Hi, Paul
When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
stuck for several minutes.
And on 5.18, there is no such problem.
After revert this patch, the issue solved.
Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers
from consuming CPU)
qemu cmd:
build/aarch64-softmmu/qemu-system-aarch64 -machine
virt,gic-version=3,iommu=smmuv3 \
-enable-kvm -cpu host -m 1024 \
-kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
"rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
-bios QEMU_EFI.fd
log:
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
add-symbol-file
/home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
0x75459000
Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
ProtectUefiImageCommon - 0x7AA4D040
- 0x0000000075458000 - 0x000000000003F000
SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
(0x0000000000004008)
SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
(0x0000000000020008)
SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
(0x0000000000004008)
InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
Not sure it is either reported or solved.
Thanks
On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> Hi, Paul
>
> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> stuck for several minutes.
Stuck for several minutes but then continues normally? Or stuck for
several minutes before you kill qemu?
And I have to ask... What happened without the ACPI RMR feature?
> And on 5.18, there is no such problem.
>
> After revert this patch, the issue solved.
> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from?
> consuming CPU)
>
>
> qemu cmd:
> build/aarch64-softmmu/qemu-system-aarch64 -machine
> virt,gic-version=3,iommu=smmuv3 \
> -enable-kvm -cpu host -m 1024 \
> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> -bios QEMU_EFI.fd
>
> log:
> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> 0x75459000
> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> ProtectUefiImageCommon - 0x7AA4D040
> ? - 0x0000000075458000 - 0x000000000003F000
> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> (0x0000000000004008)
> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> (0x0000000000020008)
> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> (0x0000000000004008)
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>
>
> Not sure it is either reported or solved.
This is the first I have heard of it, so thank you for reporting it.
Do you have a way of collecting something sysrq-t output?
Thanx, Paul
Hi, Paul
On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 03:40:30PM +0800, [email protected] wrote:
>> Hi, Paul
>>
>> On 2022/6/12 上午12:59, Paul E. McKenney wrote:
>>> On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
>>>> Hi, Paul
>>>>
>>>> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
>>>> stuck for several minutes.
>>> Stuck for several minutes but then continues normally? Or stuck for
>>> several minutes before you kill qemu?
>> qemu boot stuck for several minutes, then guest can bootup normally, just
>> slower.
>>> And I have to ask... What happened without the ACPI RMR feature?
>> If no ACPI, qemu boot quickly without stuck.
>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>> virt,gic-version=3,iommu=smmuv3 \
>> -enable-kvm -cpu host -m 1024 \
>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
>>
>> Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
>>
>> By the way, my hardware platform is aarch64.
> Thank you for the information! The problem is excessive delay rather
> than a hang, and it is configuration-dependent. Good to know!
>
>> Only change this can solve the stuck issue.
>>
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
>> *ssp)
>> {
>> unsigned long jbase = SRCU_INTERVAL;
>>
>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>> + return 0;
>> + return SRCU_INTERVAL;
> I am glad that you have a workaround for this problem, but this change
> would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> expedited GPs and blocking readers from consuming CPU") was intended
> to fix. For one example, your change can prevent kernel live patching
> from applying a patch. So something else is needed.
Understand, it is just debug where has issue.
>
> Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> significantly? (This is not a fix, either, but instead a debug check.)
No use.
>
> Your change always returns zero if another SRCU grace period is needed.
> Let's look at the callers of srcu_get_delay():
>
> o cleanup_srcu_struct() uses it to check whether there is an
> expedited grace period pending, leaking the srcu_struct if so.
> This should not affect boot delay. (Unless you are invoking
> init_srcu_struct() and cleanup_srcu_struct() really really
> often.)
>
> o srcu_gp_end() uses it to determine whether or not to allow
> a one-jiffy delay before invoking callbacks at the end of
> a grace period.
>
> o srcu_funnel_gp_start() uses it to determine whether or not to
> allow a one-jiffy delay before starting the process of checking
> for the end of an SRCU grace period.
>
> o try_check_zero() uses it to add an additional short delay
> (instead of a long delay) between checks of reader state.
>
> o process_srcu() uses it to calculate the long delay between
> checks of reader state.
>
> These add one-jiffy delays, except for process_srcu(), which adds a delay
> of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that
> I normally use), this requires thousands of such delays to add up to the
> several minutes that you are seeing. (In theory, the delays could also
> be due to SRCU readers, except that in that case adjusting timeouts in
> the grace-period processing would not make things go faster.)
>
> So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so,
> it would be very good make sure that this code is using SRCU efficiently.
> One way to check would be to put a printk() into synchronize_srcu(),
> though maintaining a counter and printing (say) every 1000th invocation
> might be easier on the console output.
good idea.
>>>> And on 5.18, there is no such problem.
>>>>
>>>> After revert this patch, the issue solved.
>>>> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
>>>> consuming CPU)
>>>>
>>>>
>>>> qemu cmd:
>>>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>>>> virt,gic-version=3,iommu=smmuv3 \
>>>> -enable-kvm -cpu host -m 1024 \
>>>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>>>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
>>>> -bios QEMU_EFI.fd
>>>>
>>>> log:
>>>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
>>>> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
>>>> 0x75459000
>>>> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
>>>> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
>>>> ProtectUefiImageCommon - 0x7AA4D040
>>>> - 0x0000000075458000 - 0x000000000003F000
>>>> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
>>>> (0x0000000000004008)
>>>> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
>>>> (0x0000000000020008)
>>>> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
>>>> (0x0000000000004008)
>>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
>>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>>>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
>>>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>>>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>>>> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
>>>> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
>>>> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>>>>
>>>>
>>>> Not sure it is either reported or solved.
>>> This is the first I have heard of it, so thank you for reporting it.
>>>
>>> Do you have a way of collecting something sysrq-t output?
>> Do you mean "echo t > /proc/sysrq-trigger",
>> There are too much output and kernel dump can not stop.
> OK. What other tools do you have to work out what is happening during
> temporary hangs such as this one?
>
> The question to be answered: "Is there usually at least one task waiting
> in synchronize_srcu() during these hangs, and if so, which srcu_struct
> is passed to those synchronize_srcu() calls?"
As you suggested, add print in __synchronize_srcu, 1000 times print once.
With acpi=force & -bios QEMU_EFI.fd
When qemu stuck in
InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
add-symbol-file
/home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
0x75459000
The print in __synchronize_srcu is print from 0 t0 9001.
[ 94.271350] gzf __synchronize_srcu loop=1001
....
[ 222.621659] __synchronize_srcu loop=9001
[ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
5.19.0-rc1-15071-g697f40b5235f-dirty #615
[ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD,
BIOS 2280-V2 CS V5.B133.01 03/25/2021
[ 222.621667] Call trace:
[ 222.621668] dump_backtrace+0xe4/0xf0
[ 222.621670] show_stack+0x20/0x70
[ 222.621672] dump_stack_lvl+0x8c/0xb8
[ 222.621674] dump_stack+0x18/0x34
[ 222.621676] __synchronize_srcu+0x120/0x128
[ 222.621678] synchronize_srcu_expedited+0x2c/0x40
[ 222.621680] kvm_swap_active_memslots+0x130/0x198
[ 222.621683] kvm_activate_memslot+0x40/0x68
[ 222.621684] kvm_set_memslot+0x184/0x3b0
[ 222.621686] __kvm_set_memory_region+0x288/0x438
[ 222.621688] kvm_set_memory_region+0x3c/0x60
[ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0
[ 222.621691] __arm64_sys_ioctl+0xb0/0xf8
[ 222.621693] invoke_syscall+0x4c/0x110
[ 222.621696] el0_svc_common.constprop.0+0x68/0x128
[ 222.621698] do_el0_svc+0x34/0xc0
[ 222.621701] el0_svc+0x30/0x98
[ 222.621704] el0t_64_sync_handler+0xb8/0xc0
[ 222.621706] el0t_64_sync+0x18c/0x190
If no acpi=force, no print at all, 1000 times one print.
Thanks
Hi, Paul
On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
>> Hi, Paul
>>
>> When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
>> stuck for several minutes.
> Stuck for several minutes but then continues normally? Or stuck for
> several minutes before you kill qemu?
qemu boot stuck for several minutes, then guest can bootup normally,
just slower.
>
> And I have to ask... What happened without the ACPI RMR feature?
If no ACPI, qemu boot quickly without stuck.
build/aarch64-softmmu/qemu-system-aarch64 -machine
virt,gic-version=3,iommu=smmuv3 \
-enable-kvm -cpu host -m 1024 \
-kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
"rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
By the way, my hardware platform is aarch64.
Only change this can solve the stuck issue.
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct
srcu_struct *ssp)
{
unsigned long jbase = SRCU_INTERVAL;
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+ return 0;
+ return SRCU_INTERVAL;
+
>
>> And on 5.18, there is no such problem.
>>
>> After revert this patch, the issue solved.
>> Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
>> consuming CPU)
>>
>>
>> qemu cmd:
>> build/aarch64-softmmu/qemu-system-aarch64 -machine
>> virt,gic-version=3,iommu=smmuv3 \
>> -enable-kvm -cpu host -m 1024 \
>> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
>> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
>> -bios QEMU_EFI.fd
>>
>> log:
>> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
>> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
>> 0x75459000
>> Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
>> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
>> ProtectUefiImageCommon - 0x7AA4D040
>> - 0x0000000075458000 - 0x000000000003F000
>> SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
>> (0x0000000000004008)
>> SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
>> (0x0000000000020008)
>> SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
>> (0x0000000000004008)
>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>> InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
>> InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
>> InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
>> InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
>> InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
>> InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
>>
>>
>> Not sure it is either reported or solved.
> This is the first I have heard of it, so thank you for reporting it.
>
> Do you have a way of collecting something sysrq-t output?
Do you mean "echo t > /proc/sysrq-trigger",
There are too much output and kernel dump can not stop.
Thanks
On Sun, Jun 12, 2022 at 03:40:30PM +0800, [email protected] wrote:
> Hi, Paul
>
> On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > Hi, Paul
> > >
> > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > stuck for several minutes.
> > Stuck for several minutes but then continues normally? Or stuck for
> > several minutes before you kill qemu?
>
> qemu boot stuck for several minutes, then guest can bootup normally, just
> slower.
> >
> > And I have to ask... What happened without the ACPI RMR feature?
> If no ACPI, qemu boot quickly without stuck.
> build/aarch64-softmmu/qemu-system-aarch64 -machine
> virt,gic-version=3,iommu=smmuv3 \
> -enable-kvm -cpu host -m 1024 \
> -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
>
> Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
>
> By the way, my hardware platform is aarch64.
Thank you for the information! The problem is excessive delay rather
than a hang, and it is configuration-dependent. Good to know!
> Only change this can solve the stuck issue.
>
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> *ssp)
> {
> unsigned long jbase = SRCU_INTERVAL;
>
> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> + return 0;
> + return SRCU_INTERVAL;
I am glad that you have a workaround for this problem, but this change
would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
expedited GPs and blocking readers from consuming CPU") was intended
to fix. For one example, your change can prevent kernel live patching
from applying a patch. So something else is needed.
Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
significantly? (This is not a fix, either, but instead a debug check.)
Your change always returns zero if another SRCU grace period is needed.
Let's look at the callers of srcu_get_delay():
o cleanup_srcu_struct() uses it to check whether there is an
expedited grace period pending, leaking the srcu_struct if so.
This should not affect boot delay. (Unless you are invoking
init_srcu_struct() and cleanup_srcu_struct() really really
often.)
o srcu_gp_end() uses it to determine whether or not to allow
a one-jiffy delay before invoking callbacks at the end of
a grace period.
o srcu_funnel_gp_start() uses it to determine whether or not to
allow a one-jiffy delay before starting the process of checking
for the end of an SRCU grace period.
o try_check_zero() uses it to add an additional short delay
(instead of a long delay) between checks of reader state.
o process_srcu() uses it to calculate the long delay between
checks of reader state.
These add one-jiffy delays, except for process_srcu(), which adds a delay
of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that
I normally use), this requires thousands of such delays to add up to the
several minutes that you are seeing. (In theory, the delays could also
be due to SRCU readers, except that in that case adjusting timeouts in
the grace-period processing would not make things go faster.)
So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so,
it would be very good make sure that this code is using SRCU efficiently.
One way to check would be to put a printk() into synchronize_srcu(),
though maintaining a counter and printing (say) every 1000th invocation
might be easier on the console output.
> > > And on 5.18, there is no such problem.
> > >
> > > After revert this patch, the issue solved.
> > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > consuming CPU)
> > >
> > >
> > > qemu cmd:
> > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > virt,gic-version=3,iommu=smmuv3 \
> > > -enable-kvm -cpu host -m 1024 \
> > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > -bios QEMU_EFI.fd
> > >
> > > log:
> > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > 0x75459000
> > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > ProtectUefiImageCommon - 0x7AA4D040
> > > - 0x0000000075458000 - 0x000000000003F000
> > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > (0x0000000000004008)
> > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > (0x0000000000020008)
> > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > (0x0000000000004008)
> > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > >
> > >
> > > Not sure it is either reported or solved.
> > This is the first I have heard of it, so thank you for reporting it.
> >
> > Do you have a way of collecting something sysrq-t output?
> Do you mean "echo t > /proc/sysrq-trigger",
> There are too much output and kernel dump can not stop.
OK. What other tools do you have to work out what is happening during
temporary hangs such as this one?
The question to be answered: "Is there usually at least one task waiting
in synchronize_srcu() during these hangs, and if so, which srcu_struct
is passed to those synchronize_srcu() calls?"
Thanx, Paul
On Sun, Jun 12, 2022 at 10:59:30PM +0800, [email protected] wrote:
> Hi, Paul
>
> On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 03:40:30PM +0800, [email protected] wrote:
> > > Hi, Paul
> > >
> > > On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > > > Hi, Paul
> > > > >
> > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > > > stuck for several minutes.
> > > > Stuck for several minutes but then continues normally? Or stuck for
> > > > several minutes before you kill qemu?
> > > qemu boot stuck for several minutes, then guest can bootup normally, just
> > > slower.
> > > > And I have to ask... What happened without the ACPI RMR feature?
> > > If no ACPI, qemu boot quickly without stuck.
> > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > virt,gic-version=3,iommu=smmuv3 \
> > > -enable-kvm -cpu host -m 1024 \
> > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> > >
> > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> > >
> > > By the way, my hardware platform is aarch64.
> > Thank you for the information! The problem is excessive delay rather
> > than a hang, and it is configuration-dependent. Good to know!
> >
> > > Only change this can solve the stuck issue.
> > >
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> > > *ssp)
> > > {
> > > unsigned long jbase = SRCU_INTERVAL;
> > >
> > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > + return 0;
> > > + return SRCU_INTERVAL;
> > I am glad that you have a workaround for this problem, but this change
> > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> > expedited GPs and blocking readers from consuming CPU") was intended
> > to fix. For one example, your change can prevent kernel live patching
> > from applying a patch. So something else is needed.
> Understand, it is just debug where has issue.
> >
> > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> > significantly? (This is not a fix, either, but instead a debug check.)
> No use.
OK, that indicates that you have a very large number of invocations
of synchronize_srcu() or synchronize_srcu_expedited() instead of only
a few that take a very long time each.
> > Your change always returns zero if another SRCU grace period is needed.
> > Let's look at the callers of srcu_get_delay():
> >
> > o cleanup_srcu_struct() uses it to check whether there is an
> > expedited grace period pending, leaking the srcu_struct if so.
> > This should not affect boot delay. (Unless you are invoking
> > init_srcu_struct() and cleanup_srcu_struct() really really
> > often.)
> >
> > o srcu_gp_end() uses it to determine whether or not to allow
> > a one-jiffy delay before invoking callbacks at the end of
> > a grace period.
> >
> > o srcu_funnel_gp_start() uses it to determine whether or not to
> > allow a one-jiffy delay before starting the process of checking
> > for the end of an SRCU grace period.
> >
> > o try_check_zero() uses it to add an additional short delay
> > (instead of a long delay) between checks of reader state.
> >
> > o process_srcu() uses it to calculate the long delay between
> > checks of reader state.
> >
> > These add one-jiffy delays, except for process_srcu(), which adds a delay
> > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that
> > I normally use), this requires thousands of such delays to add up to the
> > several minutes that you are seeing. (In theory, the delays could also
> > be due to SRCU readers, except that in that case adjusting timeouts in
> > the grace-period processing would not make things go faster.)
> >
> > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so,
> > it would be very good make sure that this code is using SRCU efficiently.
> > One way to check would be to put a printk() into synchronize_srcu(),
> > though maintaining a counter and printing (say) every 1000th invocation
> > might be easier on the console output.
>
> good idea.
Glad you like it. ;-)
> > > > > And on 5.18, there is no such problem.
> > > > >
> > > > > After revert this patch, the issue solved.
> > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > > > consuming CPU)
> > > > >
> > > > >
> > > > > qemu cmd:
> > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > > virt,gic-version=3,iommu=smmuv3 \
> > > > > -enable-kvm -cpu host -m 1024 \
> > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > > > -bios QEMU_EFI.fd
> > > > >
> > > > > log:
> > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > > > 0x75459000
> > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > > > ProtectUefiImageCommon - 0x7AA4D040
> > > > > - 0x0000000075458000 - 0x000000000003F000
> > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > > > (0x0000000000004008)
> > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > > > (0x0000000000020008)
> > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > > > (0x0000000000004008)
> > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > > >
> > > > >
> > > > > Not sure it is either reported or solved.
> > > > This is the first I have heard of it, so thank you for reporting it.
> > > >
> > > > Do you have a way of collecting something sysrq-t output?
> > > Do you mean "echo t > /proc/sysrq-trigger",
> > > There are too much output and kernel dump can not stop.
> > OK. What other tools do you have to work out what is happening during
> > temporary hangs such as this one?
> >
> > The question to be answered: "Is there usually at least one task waiting
> > in synchronize_srcu() during these hangs, and if so, which srcu_struct
> > is passed to those synchronize_srcu() calls?"
>
> As you suggested, add print in __synchronize_srcu, 1000 times print once.
>
> With acpi=force & -bios QEMU_EFI.fd
>
> When qemu stuck in
> InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> 0x75459000
>
> The print in __synchronize_srcu is print from 0 t0 9001.
Now that is what I call a large number of calls!
> [ 94.271350] gzf __synchronize_srcu loop=1001
> ....
>
> [ 222.621659] __synchronize_srcu loop=9001
> [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
> 5.19.0-rc1-15071-g697f40b5235f-dirty #615
> [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS
> 2280-V2 CS V5.B133.01 03/25/2021
> [ 222.621667] Call trace:
> [ 222.621668] dump_backtrace+0xe4/0xf0
> [ 222.621670] show_stack+0x20/0x70
> [ 222.621672] dump_stack_lvl+0x8c/0xb8
> [ 222.621674] dump_stack+0x18/0x34
> [ 222.621676] __synchronize_srcu+0x120/0x128
> [ 222.621678] synchronize_srcu_expedited+0x2c/0x40
> [ 222.621680] kvm_swap_active_memslots+0x130/0x198
> [ 222.621683] kvm_activate_memslot+0x40/0x68
> [ 222.621684] kvm_set_memslot+0x184/0x3b0
> [ 222.621686] __kvm_set_memory_region+0x288/0x438
> [ 222.621688] kvm_set_memory_region+0x3c/0x60
This is KVM setting up one mapping in your IORT RMR memory, correct?
(As in I/O remapping table reserved memory regions.)
> [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0
And this ioctl() is qemu telling KVM to do so, correct?
> [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8
> [ 222.621693] invoke_syscall+0x4c/0x110
> [ 222.621696] el0_svc_common.constprop.0+0x68/0x128
> [ 222.621698] do_el0_svc+0x34/0xc0
> [ 222.621701] el0_svc+0x30/0x98
> [ 222.621704] el0t_64_sync_handler+0xb8/0xc0
> [ 222.621706] el0t_64_sync+0x18c/0x190
>
>
> If no acpi=force, no print at all, 1000 times one print.
OK, this certainly explains the slowdown, both from adding the IORT RMR
and from the commit that you bisected to. The answers to a few questions
might help illuminate a path towards a fix:
Do these reserved memory regions really need to be allocated separately?
(For example, are they really all non-contiguous? If not, that is, if
there are a lot of contiguous memory regions, could you sort the IORT
by address and do one ioctl() for each set of contiguous memory regions?)
Are all of these reserved memory regions set up before init is spawned?
Are all of these reserved memory regions set up while there is only a
single vCPU up and running?
Is the SRCU grace period really needed in this case? (I freely confess
to not being all that familiar with KVM.)
Thanx, Paul
On 6/12/22 18:40, Paul E. McKenney wrote:
>> Do these reserved memory regions really need to be allocated separately?
>> (For example, are they really all non-contiguous? If not, that is, if
>> there are a lot of contiguous memory regions, could you sort the IORT
>> by address and do one ioctl() for each set of contiguous memory regions?)
>>
>> Are all of these reserved memory regions set up before init is spawned?
>>
>> Are all of these reserved memory regions set up while there is only a
>> single vCPU up and running?
>>
>> Is the SRCU grace period really needed in this case? (I freely confess
>> to not being all that familiar with KVM.)
>
> Oh, and there was a similar many-requests problem with networking many
> years ago. This was solved by adding a new syscall/ioctl()/whatever
> that permitted many requests to be presented to the kernel with a single
> system call.
>
> Could a new ioctl() be introduced that requested a large number
> of these memory regions in one go so as to make each call to
> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> requests? Adding a few of the KVM guys on CC for their thoughts.
Unfortunately not. Apart from this specific case, in general the calls
to KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers
in the guest, and those writes then map to a ioctl. Typically the guest
sets up a device at a time, and each setup step causes a
synchronize_srcu()---and expedited at that.
KVM has two SRCUs:
1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has
readers that are very very small, but it needs extremely fast detection
of grace periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers
are not so frequent.
2) kvm->srcu is nastier because there are readers all the time. The
read-side critical section are still short-ish, but they need the
sleepable part because they access user memory.
Writers are not frequent per se; the problem is they come in very large
bursts when a guest boots. And while the whole boot path overall can be
quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
impact on runtime than the O(n^2) parts, as demonstrated here.
Therefore, we operated on the assumption that the callers of
synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest
code and the desire was to get past the booting phase as fast as
possible. If the guest wants to eat host CPU it can "for(;;)" as much
as it wants; therefore, as long as expedited GPs didn't eat CPU
*throughout the whole system*, a preemptable busy wait in
synchronize_srcu_expedited() were not problematic.
This assumptions did match the SRCU code when kvm->srcu and
kvm->irq_srcu were was introduced (respectively in 2009 and 2014). But
perhaps they do not hold anymore now that each SRCU is not as
independent as it used to be in those years, and instead they use
workqueues instead?
Thanks,
Paolo
On Sun, Jun 12, 2022 at 09:20:29AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 10:59:30PM +0800, [email protected] wrote:
> > Hi, Paul
> >
> > On 2022/6/12 下午9:36, Paul E. McKenney wrote:
> > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, [email protected] wrote:
> > > > Hi, Paul
> > > >
> > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote:
> > > > > > Hi, Paul
> > > > > >
> > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel
> > > > > > stuck for several minutes.
> > > > > Stuck for several minutes but then continues normally? Or stuck for
> > > > > several minutes before you kill qemu?
> > > > qemu boot stuck for several minutes, then guest can bootup normally, just
> > > > slower.
> > > > > And I have to ask... What happened without the ACPI RMR feature?
> > > > If no ACPI, qemu boot quickly without stuck.
> > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > virt,gic-version=3,iommu=smmuv3 \
> > > > -enable-kvm -cpu host -m 1024 \
> > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off"
> > > >
> > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes.
> > > >
> > > > By the way, my hardware platform is aarch64.
> > > Thank you for the information! The problem is excessive delay rather
> > > than a hang, and it is configuration-dependent. Good to know!
> > >
> > > > Only change this can solve the stuck issue.
> > > >
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct
> > > > *ssp)
> > > > {
> > > > unsigned long jbase = SRCU_INTERVAL;
> > > >
> > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > + return 0;
> > > > + return SRCU_INTERVAL;
> > > I am glad that you have a workaround for this problem, but this change
> > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent
> > > expedited GPs and blocking readers from consuming CPU") was intended
> > > to fix. For one example, your change can prevent kernel live patching
> > > from applying a patch. So something else is needed.
> > Understand, it is just debug where has issue.
> > >
> > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay
> > > significantly? (This is not a fix, either, but instead a debug check.)
> > No use.
>
> OK, that indicates that you have a very large number of invocations
> of synchronize_srcu() or synchronize_srcu_expedited() instead of only
> a few that take a very long time each.
>
> > > Your change always returns zero if another SRCU grace period is needed.
> > > Let's look at the callers of srcu_get_delay():
> > >
> > > o cleanup_srcu_struct() uses it to check whether there is an
> > > expedited grace period pending, leaking the srcu_struct if so.
> > > This should not affect boot delay. (Unless you are invoking
> > > init_srcu_struct() and cleanup_srcu_struct() really really
> > > often.)
> > >
> > > o srcu_gp_end() uses it to determine whether or not to allow
> > > a one-jiffy delay before invoking callbacks at the end of
> > > a grace period.
> > >
> > > o srcu_funnel_gp_start() uses it to determine whether or not to
> > > allow a one-jiffy delay before starting the process of checking
> > > for the end of an SRCU grace period.
> > >
> > > o try_check_zero() uses it to add an additional short delay
> > > (instead of a long delay) between checks of reader state.
> > >
> > > o process_srcu() uses it to calculate the long delay between
> > > checks of reader state.
> > >
> > > These add one-jiffy delays, except for process_srcu(), which adds a delay
> > > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that
> > > I normally use), this requires thousands of such delays to add up to the
> > > several minutes that you are seeing. (In theory, the delays could also
> > > be due to SRCU readers, except that in that case adjusting timeouts in
> > > the grace-period processing would not make things go faster.)
> > >
> > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so,
> > > it would be very good make sure that this code is using SRCU efficiently.
> > > One way to check would be to put a printk() into synchronize_srcu(),
> > > though maintaining a counter and printing (say) every 1000th invocation
> > > might be easier on the console output.
> >
> > good idea.
>
> Glad you like it. ;-)
>
> > > > > > And on 5.18, there is no such problem.
> > > > > >
> > > > > > After revert this patch, the issue solved.
> > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from
> > > > > > consuming CPU)
> > > > > >
> > > > > >
> > > > > > qemu cmd:
> > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine
> > > > > > virt,gic-version=3,iommu=smmuv3 \
> > > > > > -enable-kvm -cpu host -m 1024 \
> > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \
> > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \
> > > > > > -bios QEMU_EFI.fd
> > > > > >
> > > > > > log:
> > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > > > > > 0x75459000
> > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi
> > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98
> > > > > > ProtectUefiImageCommon - 0x7AA4D040
> > > > > > - 0x0000000075458000 - 0x000000000003F000
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000
> > > > > > (0x0000000000004008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000
> > > > > > (0x0000000000020008)
> > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000
> > > > > > (0x0000000000004008)
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8
> > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358
> > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370
> > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348
> > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8
> > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728
> > > > > >
> > > > > >
> > > > > > Not sure it is either reported or solved.
> > > > > This is the first I have heard of it, so thank you for reporting it.
> > > > >
> > > > > Do you have a way of collecting something sysrq-t output?
> > > > Do you mean "echo t > /proc/sysrq-trigger",
> > > > There are too much output and kernel dump can not stop.
> > > OK. What other tools do you have to work out what is happening during
> > > temporary hangs such as this one?
> > >
> > > The question to be answered: "Is there usually at least one task waiting
> > > in synchronize_srcu() during these hangs, and if so, which srcu_struct
> > > is passed to those synchronize_srcu() calls?"
> >
> > As you suggested, add print in __synchronize_srcu, 1000 times print once.
> >
> > With acpi=force & -bios QEMU_EFI.fd
> >
> > When qemu stuck in
> > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040
> > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll
> > 0x75459000
> >
> > The print in __synchronize_srcu is print from 0 t0 9001.
>
> Now that is what I call a large number of calls!
>
> > [ 94.271350] gzf __synchronize_srcu loop=1001
> > ....
> >
> > [ 222.621659] __synchronize_srcu loop=9001
> > [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted
> > 5.19.0-rc1-15071-g697f40b5235f-dirty #615
> > [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS
> > 2280-V2 CS V5.B133.01 03/25/2021
> > [ 222.621667] Call trace:
> > [ 222.621668] dump_backtrace+0xe4/0xf0
> > [ 222.621670] show_stack+0x20/0x70
> > [ 222.621672] dump_stack_lvl+0x8c/0xb8
> > [ 222.621674] dump_stack+0x18/0x34
> > [ 222.621676] __synchronize_srcu+0x120/0x128
> > [ 222.621678] synchronize_srcu_expedited+0x2c/0x40
> > [ 222.621680] kvm_swap_active_memslots+0x130/0x198
> > [ 222.621683] kvm_activate_memslot+0x40/0x68
> > [ 222.621684] kvm_set_memslot+0x184/0x3b0
> > [ 222.621686] __kvm_set_memory_region+0x288/0x438
> > [ 222.621688] kvm_set_memory_region+0x3c/0x60
>
> This is KVM setting up one mapping in your IORT RMR memory, correct?
> (As in I/O remapping table reserved memory regions.)
>
> > [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0
>
> And this ioctl() is qemu telling KVM to do so, correct?
>
> > [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8
> > [ 222.621693] invoke_syscall+0x4c/0x110
> > [ 222.621696] el0_svc_common.constprop.0+0x68/0x128
> > [ 222.621698] do_el0_svc+0x34/0xc0
> > [ 222.621701] el0_svc+0x30/0x98
> > [ 222.621704] el0t_64_sync_handler+0xb8/0xc0
> > [ 222.621706] el0t_64_sync+0x18c/0x190
> >
> >
> > If no acpi=force, no print at all, 1000 times one print.
>
> OK, this certainly explains the slowdown, both from adding the IORT RMR
> and from the commit that you bisected to. The answers to a few questions
> might help illuminate a path towards a fix:
>
> Do these reserved memory regions really need to be allocated separately?
> (For example, are they really all non-contiguous? If not, that is, if
> there are a lot of contiguous memory regions, could you sort the IORT
> by address and do one ioctl() for each set of contiguous memory regions?)
>
> Are all of these reserved memory regions set up before init is spawned?
>
> Are all of these reserved memory regions set up while there is only a
> single vCPU up and running?
>
> Is the SRCU grace period really needed in this case? (I freely confess
> to not being all that familiar with KVM.)
Oh, and there was a similar many-requests problem with networking many
years ago. This was solved by adding a new syscall/ioctl()/whatever
that permitted many requests to be presented to the kernel with a single
system call.
Could a new ioctl() be introduced that requested a large number
of these memory regions in one go so as to make each call to
synchronize_rcu_expedited() cover a useful fraction of your 9000+
requests? Adding a few of the KVM guys on CC for their thoughts.
Thanx, Paul
On 6/12/22 19:29, Paolo Bonzini wrote:
> On 6/12/22 18:40, Paul E. McKenney wrote:
>>> Do these reserved memory regions really need to be allocated separately?
>>> (For example, are they really all non-contiguous? If not, that is, if
>>> there are a lot of contiguous memory regions, could you sort the IORT
>>> by address and do one ioctl() for each set of contiguous memory
>>> regions?)
>>>
>>> Are all of these reserved memory regions set up before init is spawned?
>>>
>>> Are all of these reserved memory regions set up while there is only a
>>> single vCPU up and running?
>>>
>>> Is the SRCU grace period really needed in this case? (I freely confess
>>> to not being all that familiar with KVM.)
>>
>> Oh, and there was a similar many-requests problem with networking many
>> years ago. This was solved by adding a new syscall/ioctl()/whatever
>> that permitted many requests to be presented to the kernel with a single
>> system call.
>>
>> Could a new ioctl() be introduced that requested a large number
>> of these memory regions in one go so as to make each call to
>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>> requests? Adding a few of the KVM guys on CC for their thoughts.
Another question: how much can call_srcu() callbacks pile up these days?
I've always been a bit wary of letting userspace do an arbitrary
number of allocations that can only be freed after a grace period, but
perhaps there's a way to query SRCU and apply some backpressure?
Paolo
On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> On 6/12/22 18:40, Paul E. McKenney wrote:
> > > Do these reserved memory regions really need to be allocated separately?
> > > (For example, are they really all non-contiguous? If not, that is, if
> > > there are a lot of contiguous memory regions, could you sort the IORT
> > > by address and do one ioctl() for each set of contiguous memory regions?)
> > >
> > > Are all of these reserved memory regions set up before init is spawned?
> > >
> > > Are all of these reserved memory regions set up while there is only a
> > > single vCPU up and running?
> > >
> > > Is the SRCU grace period really needed in this case? (I freely confess
> > > to not being all that familiar with KVM.)
> >
> > Oh, and there was a similar many-requests problem with networking many
> > years ago. This was solved by adding a new syscall/ioctl()/whatever
> > that permitted many requests to be presented to the kernel with a single
> > system call.
> >
> > Could a new ioctl() be introduced that requested a large number
> > of these memory regions in one go so as to make each call to
> > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > requests? Adding a few of the KVM guys on CC for their thoughts.
>
> Unfortunately not. Apart from this specific case, in general the calls to
> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> guest, and those writes then map to a ioctl. Typically the guest sets up a
> device at a time, and each setup step causes a synchronize_srcu()---and
> expedited at that.
I was afraid of something like that...
> KVM has two SRCUs:
>
> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> that are very very small, but it needs extremely fast detection of grace
> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> not so frequent.
>
> 2) kvm->srcu is nastier because there are readers all the time. The
> read-side critical section are still short-ish, but they need the sleepable
> part because they access user memory.
Which one of these two is in play in this case?
> Writers are not frequent per se; the problem is they come in very large
> bursts when a guest boots. And while the whole boot path overall can be
> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> impact on runtime than the O(n^2) parts, as demonstrated here.
>
> Therefore, we operated on the assumption that the callers of
> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> and the desire was to get past the booting phase as fast as possible. If
> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> problematic.
>
> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> were was introduced (respectively in 2009 and 2014). But perhaps they do
> not hold anymore now that each SRCU is not as independent as it used to be
> in those years, and instead they use workqueues instead?
The problem was not internal to SRCU, but rather due to the fact
that kernel live patching (KLP) had problems with the CPU-bound tasks
resulting from repeated synchronize_rcu_expedited() invocations. So I
added heuristics to get the occasional sleep in there for KLP's benefit.
Perhaps these heuristics need to be less aggressive about adding sleep.
These heuristics have these aspects:
1. The longer readers persist in an expedited SRCU grace period,
the longer the wait between successive checks of the reader
state. Roughly speaking, we wait as long as the grace period
has currently been in effect, capped at ten jiffies.
2. SRCU grace periods have several phases. We reset so that each
phase starts by not waiting (new phase, new set of readers,
so don't penalize this set for the sins of the previous set).
But once we get to the point of adding delay, we add the
delay based on the beginning of the full grace period.
Right now, the checking for grace-period length does not allow for the
possibility that a grace period might start just before the jiffies
counter gets incremented (because I didn't realize that anyone cared),
so that is one possible thing to change. I can also allow more no-delay
checks per SRCU grace-period phase.
Zhangfei, does something like the patch shown below help?
Additional adjustments are likely needed to avoid re-breaking KLP,
but we have to start somewhere...
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019dea..6a354368ac1d1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
/*
@@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
*/
static unsigned long srcu_get_delay(struct srcu_struct *ssp)
{
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+ }
if (!jbase) {
WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
On Sun, Jun 12, 2022 at 07:47:10PM +0200, Paolo Bonzini wrote:
> On 6/12/22 19:29, Paolo Bonzini wrote:
> > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > Do these reserved memory regions really need to be allocated separately?
> > > > (For example, are they really all non-contiguous?? If not, that is, if
> > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > by address and do one ioctl() for each set of contiguous memory
> > > > regions?)
> > > >
> > > > Are all of these reserved memory regions set up before init is spawned?
> > > >
> > > > Are all of these reserved memory regions set up while there is only a
> > > > single vCPU up and running?
> > > >
> > > > Is the SRCU grace period really needed in this case?? (I freely confess
> > > > to not being all that familiar with KVM.)
> > >
> > > Oh, and there was a similar many-requests problem with networking many
> > > years ago.? This was solved by adding a new syscall/ioctl()/whatever
> > > that permitted many requests to be presented to the kernel with a single
> > > system call.
> > >
> > > Could a new ioctl() be introduced that requested a large number
> > > of these memory regions in one go so as to make each call to
> > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > requests?? Adding a few of the KVM guys on CC for their thoughts.
>
> Another question: how much can call_srcu() callbacks pile up these days?
> I've always been a bit wary of letting userspace do an arbitrary number of
> allocations that can only be freed after a grace period, but perhaps there's
> a way to query SRCU and apply some backpressure?
They can pile up as much as ever, especially if you have long-duration
sleeping readers.
But you could do the occasional srcu_barrier() to wait for all the
preceding ones to get done. Maybe every 1000th call_srcu() or similar?
Thanx, Paul
On 6/12/22 20:49, Paul E. McKenney wrote:
>>
>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>> that are very very small, but it needs extremely fast detection of grace
>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
>> not so frequent.
>>
>> 2) kvm->srcu is nastier because there are readers all the time. The
>> read-side critical section are still short-ish, but they need the sleepable
>> part because they access user memory.
>
> Which one of these two is in play in this case?
The latter, kvm->srcu; though at boot time both are hammered on quite a
bit (and then essentially not at all).
For the one involved it's still pretty rare for readers to sleep, but it
cannot be excluded. Most critical sections are short, I'd guess in the
thousands of clock cycles but I can add some instrumentation tomorrow
(or anyway before Tuesday).
> The problem was not internal to SRCU, but rather due to the fact
> that kernel live patching (KLP) had problems with the CPU-bound tasks
> resulting from repeated synchronize_rcu_expedited() invocations.
I see. Perhaps only add to the back-to-back counter if the
synchronize_srcu_expedited() takes longer than a jiffy? This would
indirectly check if syncronize_srcu_expedited() readers are actually
blocking. KVM uses syncronize_srcu_expedited() because it expects it to
take very little (again I'll get hard numbers asap).
Paolo
On Sun, Jun 12, 2022 at 09:23:14PM +0200, Paolo Bonzini wrote:
> On 6/12/22 20:49, Paul E. McKenney wrote:
> > >
> > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > that are very very small, but it needs extremely fast detection of grace
> > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> > > not so frequent.
> > >
> > > 2) kvm->srcu is nastier because there are readers all the time. The
> > > read-side critical section are still short-ish, but they need the sleepable
> > > part because they access user memory.
> >
> > Which one of these two is in play in this case?
>
> The latter, kvm->srcu; though at boot time both are hammered on quite a bit
> (and then essentially not at all).
>
> For the one involved it's still pretty rare for readers to sleep, but it
> cannot be excluded. Most critical sections are short, I'd guess in the
> thousands of clock cycles but I can add some instrumentation tomorrow (or
> anyway before Tuesday).
And in any case, readers can be preempted.
> > The problem was not internal to SRCU, but rather due to the fact
> > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > resulting from repeated synchronize_rcu_expedited() invocations.
>
> I see. Perhaps only add to the back-to-back counter if the
> synchronize_srcu_expedited() takes longer than a jiffy? This would
> indirectly check if syncronize_srcu_expedited() readers are actually
> blocking. KVM uses syncronize_srcu_expedited() because it expects it to
> take very little (again I'll get hard numbers asap).
This is in effect what the patch in my previous email does. In current
mainline, it waits for up to a jiffy before switching to sleep mode,
but with the patch it waits for between one and two jiffies before making
that switch.
Using call_srcu() with the occasional srcu_barrier() would of course be
faster still, but perhaps more complex.
Thanx, Paul
Hi, Paul
On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>> Do these reserved memory regions really need to be allocated separately?
>>>> (For example, are they really all non-contiguous? If not, that is, if
>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>
>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>
>>>> Are all of these reserved memory regions set up while there is only a
>>>> single vCPU up and running?
>>>>
>>>> Is the SRCU grace period really needed in this case? (I freely confess
>>>> to not being all that familiar with KVM.)
>>> Oh, and there was a similar many-requests problem with networking many
>>> years ago. This was solved by adding a new syscall/ioctl()/whatever
>>> that permitted many requests to be presented to the kernel with a single
>>> system call.
>>>
>>> Could a new ioctl() be introduced that requested a large number
>>> of these memory regions in one go so as to make each call to
>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>> requests? Adding a few of the KVM guys on CC for their thoughts.
>> Unfortunately not. Apart from this specific case, in general the calls to
>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>> guest, and those writes then map to a ioctl. Typically the guest sets up a
>> device at a time, and each setup step causes a synchronize_srcu()---and
>> expedited at that.
> I was afraid of something like that...
>
>> KVM has two SRCUs:
>>
>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>> that are very very small, but it needs extremely fast detection of grace
>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
>> not so frequent.
>>
>> 2) kvm->srcu is nastier because there are readers all the time. The
>> read-side critical section are still short-ish, but they need the sleepable
>> part because they access user memory.
> Which one of these two is in play in this case?
>
>> Writers are not frequent per se; the problem is they come in very large
>> bursts when a guest boots. And while the whole boot path overall can be
>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>
>> Therefore, we operated on the assumption that the callers of
>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>> and the desire was to get past the booting phase as fast as possible. If
>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>> problematic.
>>
>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>> were was introduced (respectively in 2009 and 2014). But perhaps they do
>> not hold anymore now that each SRCU is not as independent as it used to be
>> in those years, and instead they use workqueues instead?
> The problem was not internal to SRCU, but rather due to the fact
> that kernel live patching (KLP) had problems with the CPU-bound tasks
> resulting from repeated synchronize_rcu_expedited() invocations. So I
> added heuristics to get the occasional sleep in there for KLP's benefit.
> Perhaps these heuristics need to be less aggressive about adding sleep.
>
> These heuristics have these aspects:
>
> 1. The longer readers persist in an expedited SRCU grace period,
> the longer the wait between successive checks of the reader
> state. Roughly speaking, we wait as long as the grace period
> has currently been in effect, capped at ten jiffies.
>
> 2. SRCU grace periods have several phases. We reset so that each
> phase starts by not waiting (new phase, new set of readers,
> so don't penalize this set for the sins of the previous set).
> But once we get to the point of adding delay, we add the
> delay based on the beginning of the full grace period.
>
> Right now, the checking for grace-period length does not allow for the
> possibility that a grace period might start just before the jiffies
> counter gets incremented (because I didn't realize that anyone cared),
> so that is one possible thing to change. I can also allow more no-delay
> checks per SRCU grace-period phase.
>
> Zhangfei, does something like the patch shown below help?
>
> Additional adjustments are likely needed to avoid re-breaking KLP,
> but we have to start somewhere...
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019dea..6a354368ac1d1 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>
> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
>
> /*
> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> */
> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> {
> + unsigned long gpstart;
> + unsigned long j;
> unsigned long jbase = SRCU_INTERVAL;
>
> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> jbase = 0;
> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> + j = jiffies - 1;
> + gpstart = READ_ONCE(ssp->srcu_gp_start);
> + if (time_after(j, gpstart))
> + jbase += j - gpstart;
> + }
> if (!jbase) {
> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
Unfortunately, this patch does not helpful.
Then re-add the debug info.
During the qemu boot
[ 232.997667] __synchronize_srcu loop=1000
[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000
On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
> Hi, Paul
>
> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > (For example, are they really all non-contiguous? If not, that is, if
> > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > >
> > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > >
> > > > > Are all of these reserved memory regions set up while there is only a
> > > > > single vCPU up and running?
> > > > >
> > > > > Is the SRCU grace period really needed in this case? (I freely confess
> > > > > to not being all that familiar with KVM.)
> > > > Oh, and there was a similar many-requests problem with networking many
> > > > years ago. This was solved by adding a new syscall/ioctl()/whatever
> > > > that permitted many requests to be presented to the kernel with a single
> > > > system call.
> > > >
> > > > Could a new ioctl() be introduced that requested a large number
> > > > of these memory regions in one go so as to make each call to
> > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > requests? Adding a few of the KVM guys on CC for their thoughts.
> > > Unfortunately not. Apart from this specific case, in general the calls to
> > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > guest, and those writes then map to a ioctl. Typically the guest sets up a
> > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > expedited at that.
> > I was afraid of something like that...
> >
> > > KVM has two SRCUs:
> > >
> > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > that are very very small, but it needs extremely fast detection of grace
> > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> > > not so frequent.
> > >
> > > 2) kvm->srcu is nastier because there are readers all the time. The
> > > read-side critical section are still short-ish, but they need the sleepable
> > > part because they access user memory.
> > Which one of these two is in play in this case?
> >
> > > Writers are not frequent per se; the problem is they come in very large
> > > bursts when a guest boots. And while the whole boot path overall can be
> > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > >
> > > Therefore, we operated on the assumption that the callers of
> > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > and the desire was to get past the booting phase as fast as possible. If
> > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > problematic.
> > >
> > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > were was introduced (respectively in 2009 and 2014). But perhaps they do
> > > not hold anymore now that each SRCU is not as independent as it used to be
> > > in those years, and instead they use workqueues instead?
> > The problem was not internal to SRCU, but rather due to the fact
> > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > resulting from repeated synchronize_rcu_expedited() invocations. So I
> > added heuristics to get the occasional sleep in there for KLP's benefit.
> > Perhaps these heuristics need to be less aggressive about adding sleep.
> >
> > These heuristics have these aspects:
> >
> > 1. The longer readers persist in an expedited SRCU grace period,
> > the longer the wait between successive checks of the reader
> > state. Roughly speaking, we wait as long as the grace period
> > has currently been in effect, capped at ten jiffies.
> >
> > 2. SRCU grace periods have several phases. We reset so that each
> > phase starts by not waiting (new phase, new set of readers,
> > so don't penalize this set for the sins of the previous set).
> > But once we get to the point of adding delay, we add the
> > delay based on the beginning of the full grace period.
> >
> > Right now, the checking for grace-period length does not allow for the
> > possibility that a grace period might start just before the jiffies
> > counter gets incremented (because I didn't realize that anyone cared),
> > so that is one possible thing to change. I can also allow more no-delay
> > checks per SRCU grace-period phase.
> >
> > Zhangfei, does something like the patch shown below help?
> >
> > Additional adjustments are likely needed to avoid re-breaking KLP,
> > but we have to start somewhere...
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 50ba70f019dea..6a354368ac1d1 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
> > /*
> > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > */
> > static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > {
> > + unsigned long gpstart;
> > + unsigned long j;
> > unsigned long jbase = SRCU_INTERVAL;
> > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > jbase = 0;
> > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > + j = jiffies - 1;
> > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > + if (time_after(j, gpstart))
> > + jbase += j - gpstart;
> > + }
> > if (!jbase) {
> > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> Unfortunately, this patch does not helpful.
>
> Then re-add the debug info.
>
> During the qemu boot
> [ 232.997667] __synchronize_srcu loop=1000
>
> [ 361.094493] __synchronize_srcu loop=9000
> [ 361.094501] Call trace:
> [ 361.094502] dump_backtrace+0xe4/0xf0
> [ 361.094505] show_stack+0x20/0x70
> [ 361.094507] dump_stack_lvl+0x8c/0xb8
> [ 361.094509] dump_stack+0x18/0x34
> [ 361.094511] __synchronize_srcu+0x120/0x128
> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> [ 361.094519] kvm_activate_memslot+0x40/0x68
> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> [ 361.094524] kvm_set_memory_region+0x78/0xb8
> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> [ 361.094530] invoke_syscall+0x4c/0x110
> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> [ 361.094536] do_el0_svc+0x34/0xc0
> [ 361.094538] el0_svc+0x30/0x98
> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> [ 361.094544] el0t_64_sync+0x18c/0x190
> [ 363.942817] kvm_set_memory_region loop=6000
Huh.
One possibility is that the "if (!jbase)" block needs to be nested
within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
One additional debug is to apply the patch below on top of the one you
just now kindly tested, then use whatever debug technique you wish to
work out what fraction of the time during that critical interval that
srcu_get_delay() returns non-zero.
Other thoughts?
Thanx, Paul
On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
> > Hi, Paul
> >
> > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > (For example, are they really all non-contiguous? If not, that is, if
> > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > >
> > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > >
> > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > single vCPU up and running?
> > > > > >
> > > > > > Is the SRCU grace period really needed in this case? (I freely confess
> > > > > > to not being all that familiar with KVM.)
> > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever
> > > > > that permitted many requests to be presented to the kernel with a single
> > > > > system call.
> > > > >
> > > > > Could a new ioctl() be introduced that requested a large number
> > > > > of these memory regions in one go so as to make each call to
> > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > requests? Adding a few of the KVM guys on CC for their thoughts.
> > > > Unfortunately not. Apart from this specific case, in general the calls to
> > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > guest, and those writes then map to a ioctl. Typically the guest sets up a
> > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > expedited at that.
> > > I was afraid of something like that...
> > >
> > > > KVM has two SRCUs:
> > > >
> > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > that are very very small, but it needs extremely fast detection of grace
> > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> > > > not so frequent.
> > > >
> > > > 2) kvm->srcu is nastier because there are readers all the time. The
> > > > read-side critical section are still short-ish, but they need the sleepable
> > > > part because they access user memory.
> > > Which one of these two is in play in this case?
> > >
> > > > Writers are not frequent per se; the problem is they come in very large
> > > > bursts when a guest boots. And while the whole boot path overall can be
> > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > >
> > > > Therefore, we operated on the assumption that the callers of
> > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > and the desire was to get past the booting phase as fast as possible. If
> > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > problematic.
> > > >
> > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > were was introduced (respectively in 2009 and 2014). But perhaps they do
> > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > in those years, and instead they use workqueues instead?
> > > The problem was not internal to SRCU, but rather due to the fact
> > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > resulting from repeated synchronize_rcu_expedited() invocations. So I
> > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > >
> > > These heuristics have these aspects:
> > >
> > > 1. The longer readers persist in an expedited SRCU grace period,
> > > the longer the wait between successive checks of the reader
> > > state. Roughly speaking, we wait as long as the grace period
> > > has currently been in effect, capped at ten jiffies.
> > >
> > > 2. SRCU grace periods have several phases. We reset so that each
> > > phase starts by not waiting (new phase, new set of readers,
> > > so don't penalize this set for the sins of the previous set).
> > > But once we get to the point of adding delay, we add the
> > > delay based on the beginning of the full grace period.
> > >
> > > Right now, the checking for grace-period length does not allow for the
> > > possibility that a grace period might start just before the jiffies
> > > counter gets incremented (because I didn't realize that anyone cared),
> > > so that is one possible thing to change. I can also allow more no-delay
> > > checks per SRCU grace-period phase.
> > >
> > > Zhangfei, does something like the patch shown below help?
> > >
> > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > but we have to start somewhere...
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 50ba70f019dea..6a354368ac1d1 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
> > > /*
> > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > */
> > > static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > {
> > > + unsigned long gpstart;
> > > + unsigned long j;
> > > unsigned long jbase = SRCU_INTERVAL;
> > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > jbase = 0;
> > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > + j = jiffies - 1;
> > > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > + if (time_after(j, gpstart))
> > > + jbase += j - gpstart;
> > > + }
> > > if (!jbase) {
> > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > Unfortunately, this patch does not helpful.
> >
> > Then re-add the debug info.
> >
> > During the qemu boot
> > [ 232.997667] __synchronize_srcu loop=1000
> >
> > [ 361.094493] __synchronize_srcu loop=9000
> > [ 361.094501] Call trace:
> > [ 361.094502] dump_backtrace+0xe4/0xf0
> > [ 361.094505] show_stack+0x20/0x70
> > [ 361.094507] dump_stack_lvl+0x8c/0xb8
> > [ 361.094509] dump_stack+0x18/0x34
> > [ 361.094511] __synchronize_srcu+0x120/0x128
> > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> > [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> > [ 361.094519] kvm_activate_memslot+0x40/0x68
> > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> > [ 361.094524] kvm_set_memory_region+0x78/0xb8
> > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> > [ 361.094530] invoke_syscall+0x4c/0x110
> > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> > [ 361.094536] do_el0_svc+0x34/0xc0
> > [ 361.094538] el0_svc+0x30/0x98
> > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> > [ 361.094544] el0t_64_sync+0x18c/0x190
> > [ 363.942817] kvm_set_memory_region loop=6000
>
> Huh.
>
> One possibility is that the "if (!jbase)" block needs to be nested
> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
above change reduces the overhead by more than an order of magnitude.
Except that the overhead of the series is far less than one second,
not the several minutes that you are seeing. So the per-call overhead
decreases from about 17 microseconds to a bit more than one microsecond.
I could imagine an extra order of magnitude if you are running HZ=100
instead of the HZ=1000 that I am running. But that only gets up to a
few seconds.
> One additional debug is to apply the patch below on top of the one you
> just now kindly tested, then use whatever debug technique you wish to
> work out what fraction of the time during that critical interval that
> srcu_get_delay() returns non-zero.
So I am very interested in the above debug. ;-)
Other thoughts?
Thanx, Paul
Hi, Paul
On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
>>> Hi, Paul
>>>
>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>> Do these reserved memory regions really need to be allocated separately?
>>>>>>> (For example, are they really all non-contiguous? If not, that is, if
>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up while there is only a
>>>>>>> single vCPU up and running?
>>>>>>>
>>>>>>> Is the SRCU grace period really needed in this case? (I freely confess
>>>>>>> to not being all that familiar with KVM.)
>>>>>> Oh, and there was a similar many-requests problem with networking many
>>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever
>>>>>> that permitted many requests to be presented to the kernel with a single
>>>>>> system call.
>>>>>>
>>>>>> Could a new ioctl() be introduced that requested a large number
>>>>>> of these memory regions in one go so as to make each call to
>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>>>>> requests? Adding a few of the KVM guys on CC for their thoughts.
>>>>> Unfortunately not. Apart from this specific case, in general the calls to
>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a
>>>>> device at a time, and each setup step causes a synchronize_srcu()---and
>>>>> expedited at that.
>>>> I was afraid of something like that...
>>>>
>>>>> KVM has two SRCUs:
>>>>>
>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>>>>> that are very very small, but it needs extremely fast detection of grace
>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
>>>>> not so frequent.
>>>>>
>>>>> 2) kvm->srcu is nastier because there are readers all the time. The
>>>>> read-side critical section are still short-ish, but they need the sleepable
>>>>> part because they access user memory.
>>>> Which one of these two is in play in this case?
>>>>
>>>>> Writers are not frequent per se; the problem is they come in very large
>>>>> bursts when a guest boots. And while the whole boot path overall can be
>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>>>>
>>>>> Therefore, we operated on the assumption that the callers of
>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>>>>> and the desire was to get past the booting phase as fast as possible. If
>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>>>>> problematic.
>>>>>
>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do
>>>>> not hold anymore now that each SRCU is not as independent as it used to be
>>>>> in those years, and instead they use workqueues instead?
>>>> The problem was not internal to SRCU, but rather due to the fact
>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
>>>> resulting from repeated synchronize_rcu_expedited() invocations. So I
>>>> added heuristics to get the occasional sleep in there for KLP's benefit.
>>>> Perhaps these heuristics need to be less aggressive about adding sleep.
>>>>
>>>> These heuristics have these aspects:
>>>>
>>>> 1. The longer readers persist in an expedited SRCU grace period,
>>>> the longer the wait between successive checks of the reader
>>>> state. Roughly speaking, we wait as long as the grace period
>>>> has currently been in effect, capped at ten jiffies.
>>>>
>>>> 2. SRCU grace periods have several phases. We reset so that each
>>>> phase starts by not waiting (new phase, new set of readers,
>>>> so don't penalize this set for the sins of the previous set).
>>>> But once we get to the point of adding delay, we add the
>>>> delay based on the beginning of the full grace period.
>>>>
>>>> Right now, the checking for grace-period length does not allow for the
>>>> possibility that a grace period might start just before the jiffies
>>>> counter gets incremented (because I didn't realize that anyone cared),
>>>> so that is one possible thing to change. I can also allow more no-delay
>>>> checks per SRCU grace-period phase.
>>>>
>>>> Zhangfei, does something like the patch shown below help?
>>>>
>>>> Additional adjustments are likely needed to avoid re-breaking KLP,
>>>> but we have to start somewhere...
>>>>
>>>> Thanx, Paul
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
>>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
>>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
>>>> /*
>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>> */
>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>> {
>>>> + unsigned long gpstart;
>>>> + unsigned long j;
>>>> unsigned long jbase = SRCU_INTERVAL;
>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>> jbase = 0;
>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>> + j = jiffies - 1;
>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>> + if (time_after(j, gpstart))
>>>> + jbase += j - gpstart;
>>>> + }
>>>> if (!jbase) {
>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
>>> Unfortunately, this patch does not helpful.
>>>
>>> Then re-add the debug info.
>>>
>>> During the qemu boot
>>> [ 232.997667] __synchronize_srcu loop=1000
>>>
>>> [ 361.094493] __synchronize_srcu loop=9000
>>> [ 361.094501] Call trace:
>>> [ 361.094502] dump_backtrace+0xe4/0xf0
>>> [ 361.094505] show_stack+0x20/0x70
>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>>> [ 361.094509] dump_stack+0x18/0x34
>>> [ 361.094511] __synchronize_srcu+0x120/0x128
>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>>> [ 361.094530] invoke_syscall+0x4c/0x110
>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>>> [ 361.094536] do_el0_svc+0x34/0xc0
>>> [ 361.094538] el0_svc+0x30/0x98
>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>>> [ 361.094544] el0t_64_sync+0x18c/0x190
>>> [ 363.942817] kvm_set_memory_region loop=6000
>> Huh.
>>
>> One possibility is that the "if (!jbase)" block needs to be nested
>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
I test this diff and NO helpful
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019de..36286a4b74e6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
#define SRCU_INTERVAL 1 // Base delay if no expedited
GPs pending.
#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay
from slow readers.
-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase
consecutive no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase
consecutive no-delay instances.
#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay
instances.
/*
@@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
*ssp)
*/
static unsigned long srcu_get_delay(struct srcu_struct *ssp)
{
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
- if (!jbase) {
- WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
- if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
- jbase = 1;
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+
+ if (!jbase) {
+ WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+ if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
+ jbase = 1;
+ }
}
> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> above change reduces the overhead by more than an order of magnitude.
> Except that the overhead of the series is far less than one second,
> not the several minutes that you are seeing. So the per-call overhead
> decreases from about 17 microseconds to a bit more than one microsecond.
>
> I could imagine an extra order of magnitude if you are running HZ=100
> instead of the HZ=1000 that I am running. But that only gets up to a
> few seconds.
>
>> One additional debug is to apply the patch below on top of the one you
apply the patch below?
>> just now kindly tested, then use whatever debug technique you wish to
>> work out what fraction of the time during that critical interval that
>> srcu_get_delay() returns non-zero.
Sorry, I am confused, no patch right?
Just measure srcu_get_delay return to non-zero?
By the way, the issue should be only related with qemu apci. not related
with rmr feature
Test with: https://github.com/qemu/qemu/tree/stable-6.1
Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
If no acpi, no print kvm_region_add/del (1000 times print once)
If with acpi=force,
During qemu boot
kvm_region_add region_add = 1000
kvm_region_del region_del = 1000
kvm_region_add region_add = 2000
kvm_region_del region_del = 2000
kvm_region_add region_add = 3000
kvm_region_del region_del = 3000
kvm_region_add region_add = 4000
kvm_region_del region_del = 4000
kvm_region_add region_add = 5000
kvm_region_del region_del = 5000
kvm_region_add region_add = 6000
kvm_region_del region_del = 6000
kvm_region_add/kvm_region_del ->
kvm_set_phys_mem->
kvm_set_user_memory_region->
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000
On Mon, Jun 13, 2022 at 02:55:47PM +0800, [email protected] wrote:
> Hi, Paul
>
> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
> > > > Hi, Paul
> > > >
> > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > > > (For example, are they really all non-contiguous? If not, that is, if
> > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > > > >
> > > > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > > > >
> > > > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > > > single vCPU up and running?
> > > > > > > >
> > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess
> > > > > > > > to not being all that familiar with KVM.)
> > > > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever
> > > > > > > that permitted many requests to be presented to the kernel with a single
> > > > > > > system call.
> > > > > > >
> > > > > > > Could a new ioctl() be introduced that requested a large number
> > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > > > requests? Adding a few of the KVM guys on CC for their thoughts.
> > > > > > Unfortunately not. Apart from this specific case, in general the calls to
> > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > > > guest, and those writes then map to a ioctl. Typically the guest sets up a
> > > > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > > > expedited at that.
> > > > > I was afraid of something like that...
> > > > >
> > > > > > KVM has two SRCUs:
> > > > > >
> > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > > > that are very very small, but it needs extremely fast detection of grace
> > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> > > > > > not so frequent.
> > > > > >
> > > > > > 2) kvm->srcu is nastier because there are readers all the time. The
> > > > > > read-side critical section are still short-ish, but they need the sleepable
> > > > > > part because they access user memory.
> > > > > Which one of these two is in play in this case?
> > > > >
> > > > > > Writers are not frequent per se; the problem is they come in very large
> > > > > > bursts when a guest boots. And while the whole boot path overall can be
> > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > > > >
> > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > > > and the desire was to get past the booting phase as fast as possible. If
> > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > > > problematic.
> > > > > >
> > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > > > were was introduced (respectively in 2009 and 2014). But perhaps they do
> > > > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > > > in those years, and instead they use workqueues instead?
> > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > > > resulting from repeated synchronize_rcu_expedited() invocations. So I
> > > > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > > > >
> > > > > These heuristics have these aspects:
> > > > >
> > > > > 1. The longer readers persist in an expedited SRCU grace period,
> > > > > the longer the wait between successive checks of the reader
> > > > > state. Roughly speaking, we wait as long as the grace period
> > > > > has currently been in effect, capped at ten jiffies.
> > > > >
> > > > > 2. SRCU grace periods have several phases. We reset so that each
> > > > > phase starts by not waiting (new phase, new set of readers,
> > > > > so don't penalize this set for the sins of the previous set).
> > > > > But once we get to the point of adding delay, we add the
> > > > > delay based on the beginning of the full grace period.
> > > > >
> > > > > Right now, the checking for grace-period length does not allow for the
> > > > > possibility that a grace period might start just before the jiffies
> > > > > counter gets incremented (because I didn't realize that anyone cared),
> > > > > so that is one possible thing to change. I can also allow more no-delay
> > > > > checks per SRCU grace-period phase.
> > > > >
> > > > > Zhangfei, does something like the patch shown below help?
> > > > >
> > > > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > > > but we have to start somewhere...
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > --- a/kernel/rcu/srcutree.c
> > > > > +++ b/kernel/rcu/srcutree.c
> > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> > > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> > > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
> > > > > /*
> > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > */
> > > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > > > {
> > > > > + unsigned long gpstart;
> > > > > + unsigned long j;
> > > > > unsigned long jbase = SRCU_INTERVAL;
> > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > > jbase = 0;
> > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > + j = jiffies - 1;
> > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > + if (time_after(j, gpstart))
> > > > > + jbase += j - gpstart;
> > > > > + }
> > > > > if (!jbase) {
> > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > > > Unfortunately, this patch does not helpful.
> > > >
> > > > Then re-add the debug info.
> > > >
> > > > During the qemu boot
> > > > [ 232.997667] __synchronize_srcu loop=1000
> > > >
> > > > [ 361.094493] __synchronize_srcu loop=9000
> > > > [ 361.094501] Call trace:
> > > > [ 361.094502] dump_backtrace+0xe4/0xf0
> > > > [ 361.094505] show_stack+0x20/0x70
> > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8
> > > > [ 361.094509] dump_stack+0x18/0x34
> > > > [ 361.094511] __synchronize_srcu+0x120/0x128
> > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> > > > [ 361.094519] kvm_activate_memslot+0x40/0x68
> > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8
> > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> > > > [ 361.094530] invoke_syscall+0x4c/0x110
> > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> > > > [ 361.094536] do_el0_svc+0x34/0xc0
> > > > [ 361.094538] el0_svc+0x30/0x98
> > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> > > > [ 361.094544] el0t_64_sync+0x18c/0x190
> > > > [ 363.942817] kvm_set_memory_region loop=6000
> > > Huh.
> > >
> > > One possibility is that the "if (!jbase)" block needs to be nested
> > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
>
> I test this diff and NO helpful
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019de..36286a4b74e6 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>
> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs
> pending.
> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from
> slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive
> no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive
> no-delay instances.
> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay
> instances.
>
> /*
> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
> *ssp)
> */
> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> {
> + unsigned long gpstart;
> + unsigned long j;
> unsigned long jbase = SRCU_INTERVAL;
>
> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> jbase = 0;
> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> - if (!jbase) {
> - WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> - jbase = 1;
> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> + j = jiffies - 1;
> + gpstart = READ_ONCE(ssp->srcu_gp_start);
> + if (time_after(j, gpstart))
> + jbase += j - gpstart;
> +
> + if (!jbase) {
> + WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> + jbase = 1;
> + }
> }
That is in fact what I was intending you to test, thank you. As you
say, unfortunately it did not help.
Could you please test removing the "if (!jbase)" block entirely?
> > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > above change reduces the overhead by more than an order of magnitude.
> > Except that the overhead of the series is far less than one second,
> > not the several minutes that you are seeing. So the per-call overhead
> > decreases from about 17 microseconds to a bit more than one microsecond.
> >
> > I could imagine an extra order of magnitude if you are running HZ=100
> > instead of the HZ=1000 that I am running. But that only gets up to a
> > few seconds.
One possible reason for the difference would be if your code has
SRCU readers.
Could you please tell me the value of CONFIG_HZ on your system?
Also the value of CONFIG_PREEMPTION?
> > > One additional debug is to apply the patch below on top of the one you
> apply the patch below?
> > > just now kindly tested, then use whatever debug technique you wish to
> > > work out what fraction of the time during that critical interval that
> > > srcu_get_delay() returns non-zero.
> Sorry, I am confused, no patch right?
Apologies, my omission.
> Just measure srcu_get_delay return to non-zero?
Exactly, please!
> By the way, the issue should be only related with qemu apci. not related
> with rmr feature
> Test with: https://github.com/qemu/qemu/tree/stable-6.1
>
> Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
> If no acpi, no print kvm_region_add/del (1000 times print once)
>
> If with acpi=force,
> During qemu boot
> kvm_region_add region_add = 1000
> kvm_region_del region_del = 1000
> kvm_region_add region_add = 2000
> kvm_region_del region_del = 2000
> kvm_region_add region_add = 3000
> kvm_region_del region_del = 3000
> kvm_region_add region_add = 4000
> kvm_region_del region_del = 4000
> kvm_region_add region_add = 5000
> kvm_region_del region_del = 5000
> kvm_region_add region_add = 6000
> kvm_region_del region_del = 6000
>
> kvm_region_add/kvm_region_del ->
> kvm_set_phys_mem->
> kvm_set_user_memory_region->
> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
>
> [ 361.094493] __synchronize_srcu loop=9000
> [ 361.094501] Call trace:
> [ 361.094502] dump_backtrace+0xe4/0xf0
> [ 361.094505] show_stack+0x20/0x70
> [ 361.094507] dump_stack_lvl+0x8c/0xb8
> [ 361.094509] dump_stack+0x18/0x34
> [ 361.094511] __synchronize_srcu+0x120/0x128
> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> [ 361.094519] kvm_activate_memslot+0x40/0x68
> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> [ 361.094524] kvm_set_memory_region+0x78/0xb8
> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> [ 361.094530] invoke_syscall+0x4c/0x110
> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> [ 361.094536] do_el0_svc+0x34/0xc0
> [ 361.094538] el0_svc+0x30/0x98
> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> [ 361.094544] el0t_64_sync+0x18c/0x190
> [ 363.942817] kvm_set_memory_region loop=6000
Good to know, thank you!
Thanx, Paul
On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> On Mon, Jun 13, 2022 at 02:55:47PM +0800, [email protected] wrote:
>> Hi, Paul
>>
>> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
>>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
>>>>> Hi, Paul
>>>>>
>>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>>>> Do these reserved memory regions really need to be allocated separately?
>>>>>>>>> (For example, are they really all non-contiguous? If not, that is, if
>>>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>>>>>>
>>>>>>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>>>>>>
>>>>>>>>> Are all of these reserved memory regions set up while there is only a
>>>>>>>>> single vCPU up and running?
>>>>>>>>>
>>>>>>>>> Is the SRCU grace period really needed in this case? (I freely confess
>>>>>>>>> to not being all that familiar with KVM.)
>>>>>>>> Oh, and there was a similar many-requests problem with networking many
>>>>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever
>>>>>>>> that permitted many requests to be presented to the kernel with a single
>>>>>>>> system call.
>>>>>>>>
>>>>>>>> Could a new ioctl() be introduced that requested a large number
>>>>>>>> of these memory regions in one go so as to make each call to
>>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>>>>>>> requests? Adding a few of the KVM guys on CC for their thoughts.
>>>>>>> Unfortunately not. Apart from this specific case, in general the calls to
>>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>>>>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a
>>>>>>> device at a time, and each setup step causes a synchronize_srcu()---and
>>>>>>> expedited at that.
>>>>>> I was afraid of something like that...
>>>>>>
>>>>>>> KVM has two SRCUs:
>>>>>>>
>>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>>>>>>> that are very very small, but it needs extremely fast detection of grace
>>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
>>>>>>> not so frequent.
>>>>>>>
>>>>>>> 2) kvm->srcu is nastier because there are readers all the time. The
>>>>>>> read-side critical section are still short-ish, but they need the sleepable
>>>>>>> part because they access user memory.
>>>>>> Which one of these two is in play in this case?
>>>>>>
>>>>>>> Writers are not frequent per se; the problem is they come in very large
>>>>>>> bursts when a guest boots. And while the whole boot path overall can be
>>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>>>>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>>>>>>
>>>>>>> Therefore, we operated on the assumption that the callers of
>>>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>>>>>>> and the desire was to get past the booting phase as fast as possible. If
>>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>>>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>>>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>>>>>>> problematic.
>>>>>>>
>>>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>>>>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do
>>>>>>> not hold anymore now that each SRCU is not as independent as it used to be
>>>>>>> in those years, and instead they use workqueues instead?
>>>>>> The problem was not internal to SRCU, but rather due to the fact
>>>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
>>>>>> resulting from repeated synchronize_rcu_expedited() invocations. So I
>>>>>> added heuristics to get the occasional sleep in there for KLP's benefit.
>>>>>> Perhaps these heuristics need to be less aggressive about adding sleep.
>>>>>>
>>>>>> These heuristics have these aspects:
>>>>>>
>>>>>> 1. The longer readers persist in an expedited SRCU grace period,
>>>>>> the longer the wait between successive checks of the reader
>>>>>> state. Roughly speaking, we wait as long as the grace period
>>>>>> has currently been in effect, capped at ten jiffies.
>>>>>>
>>>>>> 2. SRCU grace periods have several phases. We reset so that each
>>>>>> phase starts by not waiting (new phase, new set of readers,
>>>>>> so don't penalize this set for the sins of the previous set).
>>>>>> But once we get to the point of adding delay, we add the
>>>>>> delay based on the beginning of the full grace period.
>>>>>>
>>>>>> Right now, the checking for grace-period length does not allow for the
>>>>>> possibility that a grace period might start just before the jiffies
>>>>>> counter gets incremented (because I didn't realize that anyone cared),
>>>>>> so that is one possible thing to change. I can also allow more no-delay
>>>>>> checks per SRCU grace-period phase.
>>>>>>
>>>>>> Zhangfei, does something like the patch shown below help?
>>>>>>
>>>>>> Additional adjustments are likely needed to avoid re-breaking KLP,
>>>>>> but we have to start somewhere...
>>>>>>
>>>>>> Thanx, Paul
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
>>>>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
>>>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
>>>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
>>>>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
>>>>>> /*
>>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>>>> */
>>>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>>>> {
>>>>>> + unsigned long gpstart;
>>>>>> + unsigned long j;
>>>>>> unsigned long jbase = SRCU_INTERVAL;
>>>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>> jbase = 0;
>>>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>>> + j = jiffies - 1;
>>>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>>> + if (time_after(j, gpstart))
>>>>>> + jbase += j - gpstart;
>>>>>> + }
>>>>>> if (!jbase) {
>>>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
>>>>> Unfortunately, this patch does not helpful.
>>>>>
>>>>> Then re-add the debug info.
>>>>>
>>>>> During the qemu boot
>>>>> [ 232.997667] __synchronize_srcu loop=1000
>>>>>
>>>>> [ 361.094493] __synchronize_srcu loop=9000
>>>>> [ 361.094501] Call trace:
>>>>> [ 361.094502] dump_backtrace+0xe4/0xf0
>>>>> [ 361.094505] show_stack+0x20/0x70
>>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>>>>> [ 361.094509] dump_stack+0x18/0x34
>>>>> [ 361.094511] __synchronize_srcu+0x120/0x128
>>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>>>>> [ 361.094530] invoke_syscall+0x4c/0x110
>>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>>>>> [ 361.094536] do_el0_svc+0x34/0xc0
>>>>> [ 361.094538] el0_svc+0x30/0x98
>>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>>>>> [ 361.094544] el0t_64_sync+0x18c/0x190
>>>>> [ 363.942817] kvm_set_memory_region loop=6000
>>>> Huh.
>>>>
>>>> One possibility is that the "if (!jbase)" block needs to be nested
>>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
>> I test this diff and NO helpful
>>
>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>> index 50ba70f019de..36286a4b74e6 100644
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>
>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs
>> pending.
>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from
>> slow readers.
>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive
>> no-delay instances.
>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive
>> no-delay instances.
>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay
>> instances.
>>
>> /*
>> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
>> *ssp)
>> */
>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>> {
>> + unsigned long gpstart;
>> + unsigned long j;
>> unsigned long jbase = SRCU_INTERVAL;
>>
>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>> jbase = 0;
>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>> - if (!jbase) {
>> - WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>> - jbase = 1;
>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>> + j = jiffies - 1;
>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>> + if (time_after(j, gpstart))
>> + jbase += j - gpstart;
>> +
>> + if (!jbase) {
>> + WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>> + jbase = 1;
>> + }
>> }
> That is in fact what I was intending you to test, thank you. As you
> say, unfortunately it did not help.
>
> Could you please test removing the "if (!jbase)" block entirely?
Remove "if (!jbase)" block is much faster,
not measure clearly, qemu (with debug version efi) boot seems normally.
From log timestamp:
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000
Several method: timestamps are different.
5.19-rc1
[ 94.271350] __synchronize_srcu loop=1001
[ 222.621659] __synchronize_srcu loop=9001
With your first diff:
[ 232.997667] __synchronize_srcu loop=1000
[ 361.094493] __synchronize_srcu loop=9000
Remove "if (!jbase)" block
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000
5.18 method
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+ return 0;
+ return SRCU_INTERVAL;
[ 74.598480] __synchronize_srcu loop=9000
[ 68.938297] __synchronize_srcu loop=1000
>
>>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
>>> above change reduces the overhead by more than an order of magnitude.
>>> Except that the overhead of the series is far less than one second,
>>> not the several minutes that you are seeing. So the per-call overhead
>>> decreases from about 17 microseconds to a bit more than one microsecond.
>>>
>>> I could imagine an extra order of magnitude if you are running HZ=100
>>> instead of the HZ=1000 that I am running. But that only gets up to a
>>> few seconds.
> One possible reason for the difference would be if your code has
> SRCU readers.
>
> Could you please tell me the value of CONFIG_HZ on your system?
> Also the value of CONFIG_PREEMPTION?
I am using arch/arm64/configs/defconfig
make defconfig
CONFIG_PREEMPTION=y
CONFIG_HZ_250=y
Thanks
>
>>>> One additional debug is to apply the patch below on top of the one you
>> apply the patch below?
>>>> just now kindly tested, then use whatever debug technique you wish to
>>>> work out what fraction of the time during that critical interval that
>>>> srcu_get_delay() returns non-zero.
>> Sorry, I am confused, no patch right?
> Apologies, my omission.
>
>> Just measure srcu_get_delay return to non-zero?
> Exactly, please!
>
>> By the way, the issue should be only related with qemu apci. not related
>> with rmr feature
>> Test with: https://github.com/qemu/qemu/tree/stable-6.1
>>
>> Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
>> If no acpi, no print kvm_region_add/del (1000 times print once)
>>
>> If with acpi=force,
>> During qemu boot
>> kvm_region_add region_add = 1000
>> kvm_region_del region_del = 1000
>> kvm_region_add region_add = 2000
>> kvm_region_del region_del = 2000
>> kvm_region_add region_add = 3000
>> kvm_region_del region_del = 3000
>> kvm_region_add region_add = 4000
>> kvm_region_del region_del = 4000
>> kvm_region_add region_add = 5000
>> kvm_region_del region_del = 5000
>> kvm_region_add region_add = 6000
>> kvm_region_del region_del = 6000
>>
>> kvm_region_add/kvm_region_del ->
>> kvm_set_phys_mem->
>> kvm_set_user_memory_region->
>> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
>>
>> [ 361.094493] __synchronize_srcu loop=9000
>> [ 361.094501] Call trace:
>> [ 361.094502] dump_backtrace+0xe4/0xf0
>> [ 361.094505] show_stack+0x20/0x70
>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>> [ 361.094509] dump_stack+0x18/0x34
>> [ 361.094511] __synchronize_srcu+0x120/0x128
>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>> [ 361.094530] invoke_syscall+0x4c/0x110
>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>> [ 361.094536] do_el0_svc+0x34/0xc0
>> [ 361.094538] el0_svc+0x30/0x98
>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>> [ 361.094544] el0t_64_sync+0x18c/0x190
>> [ 363.942817] kvm_set_memory_region loop=6000
> Good to know, thank you!
>
> Thanx, Paul
On Mon, Jun 13, 2022 at 09:23:50PM +0800, [email protected] wrote:
>
>
> On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> > On Mon, Jun 13, 2022 at 02:55:47PM +0800, [email protected] wrote:
> > > Hi, Paul
> > >
> > > On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
> > > > > > Hi, Paul
> > > > > >
> > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > > > Do these reserved memory regions really need to be allocated separately?
> > > > > > > > > > (For example, are they really all non-contiguous? If not, that is, if
> > > > > > > > > > there are a lot of contiguous memory regions, could you sort the IORT
> > > > > > > > > > by address and do one ioctl() for each set of contiguous memory regions?)
> > > > > > > > > >
> > > > > > > > > > Are all of these reserved memory regions set up before init is spawned?
> > > > > > > > > >
> > > > > > > > > > Are all of these reserved memory regions set up while there is only a
> > > > > > > > > > single vCPU up and running?
> > > > > > > > > >
> > > > > > > > > > Is the SRCU grace period really needed in this case? (I freely confess
> > > > > > > > > > to not being all that familiar with KVM.)
> > > > > > > > > Oh, and there was a similar many-requests problem with networking many
> > > > > > > > > years ago. This was solved by adding a new syscall/ioctl()/whatever
> > > > > > > > > that permitted many requests to be presented to the kernel with a single
> > > > > > > > > system call.
> > > > > > > > >
> > > > > > > > > Could a new ioctl() be introduced that requested a large number
> > > > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your 9000+
> > > > > > > > > requests? Adding a few of the KVM guys on CC for their thoughts.
> > > > > > > > Unfortunately not. Apart from this specific case, in general the calls to
> > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> > > > > > > > guest, and those writes then map to a ioctl. Typically the guest sets up a
> > > > > > > > device at a time, and each setup step causes a synchronize_srcu()---and
> > > > > > > > expedited at that.
> > > > > > > I was afraid of something like that...
> > > > > > >
> > > > > > > > KVM has two SRCUs:
> > > > > > > >
> > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> > > > > > > > that are very very small, but it needs extremely fast detection of grace
> > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> > > > > > > > not so frequent.
> > > > > > > >
> > > > > > > > 2) kvm->srcu is nastier because there are readers all the time. The
> > > > > > > > read-side critical section are still short-ish, but they need the sleepable
> > > > > > > > part because they access user memory.
> > > > > > > Which one of these two is in play in this case?
> > > > > > >
> > > > > > > > Writers are not frequent per se; the problem is they come in very large
> > > > > > > > bursts when a guest boots. And while the whole boot path overall can be
> > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated here.
> > > > > > > >
> > > > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > > > synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> > > > > > > > and the desire was to get past the booting phase as fast as possible. If
> > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> > > > > > > > therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> > > > > > > > system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> > > > > > > > problematic.
> > > > > > > >
> > > > > > > > This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> > > > > > > > were was introduced (respectively in 2009 and 2014). But perhaps they do
> > > > > > > > not hold anymore now that each SRCU is not as independent as it used to be
> > > > > > > > in those years, and instead they use workqueues instead?
> > > > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > > > that kernel live patching (KLP) had problems with the CPU-bound tasks
> > > > > > > resulting from repeated synchronize_rcu_expedited() invocations. So I
> > > > > > > added heuristics to get the occasional sleep in there for KLP's benefit.
> > > > > > > Perhaps these heuristics need to be less aggressive about adding sleep.
> > > > > > >
> > > > > > > These heuristics have these aspects:
> > > > > > >
> > > > > > > 1. The longer readers persist in an expedited SRCU grace period,
> > > > > > > the longer the wait between successive checks of the reader
> > > > > > > state. Roughly speaking, we wait as long as the grace period
> > > > > > > has currently been in effect, capped at ten jiffies.
> > > > > > >
> > > > > > > 2. SRCU grace periods have several phases. We reset so that each
> > > > > > > phase starts by not waiting (new phase, new set of readers,
> > > > > > > so don't penalize this set for the sins of the previous set).
> > > > > > > But once we get to the point of adding delay, we add the
> > > > > > > delay based on the beginning of the full grace period.
> > > > > > >
> > > > > > > Right now, the checking for grace-period length does not allow for the
> > > > > > > possibility that a grace period might start just before the jiffies
> > > > > > > counter gets incremented (because I didn't realize that anyone cared),
> > > > > > > so that is one possible thing to change. I can also allow more no-delay
> > > > > > > checks per SRCU grace-period phase.
> > > > > > >
> > > > > > > Zhangfei, does something like the patch shown below help?
> > > > > > >
> > > > > > > Additional adjustments are likely needed to avoid re-breaking KLP,
> > > > > > > but we have to start somewhere...
> > > > > > >
> > > > > > > Thanx, Paul
> > > > > > >
> > > > > > > ------------------------------------------------------------------------
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > > > --- a/kernel/rcu/srcutree.c
> > > > > > > +++ b/kernel/rcu/srcutree.c
> > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> > > > > > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> > > > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> > > > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> > > > > > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
> > > > > > > /*
> > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > > > > > > */
> > > > > > > static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > > > > > {
> > > > > > > + unsigned long gpstart;
> > > > > > > + unsigned long j;
> > > > > > > unsigned long jbase = SRCU_INTERVAL;
> > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > > > > jbase = 0;
> > > > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > > > + j = jiffies - 1;
> > > > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > > > + if (time_after(j, gpstart))
> > > > > > > + jbase += j - gpstart;
> > > > > > > + }
> > > > > > > if (!jbase) {
> > > > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> > > > > > Unfortunately, this patch does not helpful.
> > > > > >
> > > > > > Then re-add the debug info.
> > > > > >
> > > > > > During the qemu boot
> > > > > > [ 232.997667] __synchronize_srcu loop=1000
> > > > > >
> > > > > > [ 361.094493] __synchronize_srcu loop=9000
> > > > > > [ 361.094501] Call trace:
> > > > > > [ 361.094502] dump_backtrace+0xe4/0xf0
> > > > > > [ 361.094505] show_stack+0x20/0x70
> > > > > > [ 361.094507] dump_stack_lvl+0x8c/0xb8
> > > > > > [ 361.094509] dump_stack+0x18/0x34
> > > > > > [ 361.094511] __synchronize_srcu+0x120/0x128
> > > > > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> > > > > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> > > > > > [ 361.094519] kvm_activate_memslot+0x40/0x68
> > > > > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> > > > > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> > > > > > [ 361.094524] kvm_set_memory_region+0x78/0xb8
> > > > > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> > > > > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> > > > > > [ 361.094530] invoke_syscall+0x4c/0x110
> > > > > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> > > > > > [ 361.094536] do_el0_svc+0x34/0xc0
> > > > > > [ 361.094538] el0_svc+0x30/0x98
> > > > > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> > > > > > [ 361.094544] el0t_64_sync+0x18c/0x190
> > > > > > [ 363.942817] kvm_set_memory_region loop=6000
> > > > > Huh.
> > > > >
> > > > > One possibility is that the "if (!jbase)" block needs to be nested
> > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
> > > I test this diff and NO helpful
> > >
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 50ba70f019de..36286a4b74e6 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> > >
> > > #define SRCU_INTERVAL 1 // Base delay if no expedited GPs
> > > pending.
> > > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from
> > > slow readers.
> > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive
> > > no-delay instances.
> > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive
> > > no-delay instances.
> > > #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay
> > > instances.
> > >
> > > /*
> > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
> > > *ssp)
> > > */
> > > static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > {
> > > + unsigned long gpstart;
> > > + unsigned long j;
> > > unsigned long jbase = SRCU_INTERVAL;
> > >
> > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > jbase = 0;
> > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > - if (!jbase) {
> > > - WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > SRCU_MAX_NODELAY_PHASE)
> > > - jbase = 1;
> > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > + j = jiffies - 1;
> > > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > + if (time_after(j, gpstart))
> > > + jbase += j - gpstart;
> > > +
> > > + if (!jbase) {
> > > + WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > SRCU_MAX_NODELAY_PHASE)
> > > + jbase = 1;
> > > + }
> > > }
> > That is in fact what I was intending you to test, thank you. As you
> > say, unfortunately it did not help.
> >
> > Could you please test removing the "if (!jbase)" block entirely?
> Remove "if (!jbase)" block is much faster,
> not measure clearly, qemu (with debug version efi) boot seems normally.
>
> From log timestamp:
> [ 114.624713] __synchronize_srcu loop=1000
> [ 124.157011] __synchronize_srcu loop=9000
>
> Several method: timestamps are different.
>
> 5.19-rc1
> [ 94.271350] __synchronize_srcu loop=1001
> [ 222.621659] __synchronize_srcu loop=9001
>
>
> With your first diff:
> [ 232.997667] __synchronize_srcu loop=1000
> [ 361.094493] __synchronize_srcu loop=9000
>
> Remove "if (!jbase)" block
> [ 114.624713] __synchronize_srcu loop=1000
> [ 124.157011] __synchronize_srcu loop=9000
>
>
> 5.18 method
> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> + return 0;
> + return SRCU_INTERVAL;
>
> [ 74.598480] __synchronize_srcu loop=9000
> [ 68.938297] __synchronize_srcu loop=1000
Thank you for the information!
What happens if you keep the that "if (!jbase)" block", but set the
value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?
This would be too large for KLP, but my hope is that there is a value
of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does
this help at all? ;-)
> > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > > > above change reduces the overhead by more than an order of magnitude.
> > > > Except that the overhead of the series is far less than one second,
> > > > not the several minutes that you are seeing. So the per-call overhead
> > > > decreases from about 17 microseconds to a bit more than one microsecond.
> > > >
> > > > I could imagine an extra order of magnitude if you are running HZ=100
> > > > instead of the HZ=1000 that I am running. But that only gets up to a
> > > > few seconds.
> > One possible reason for the difference would be if your code has
> > SRCU readers.
> >
> > Could you please tell me the value of CONFIG_HZ on your system?
> > Also the value of CONFIG_PREEMPTION?
> I am using arch/arm64/configs/defconfig
> make defconfig
> CONFIG_PREEMPTION=y
> CONFIG_HZ_250=y
Thank you again!
And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might depend
on HZ. And who knows what all else...
Thanx, Paul
> Thanks
>
> >
> > > > > One additional debug is to apply the patch below on top of the one you
> > > apply the patch below?
> > > > > just now kindly tested, then use whatever debug technique you wish to
> > > > > work out what fraction of the time during that critical interval that
> > > > > srcu_get_delay() returns non-zero.
> > > Sorry, I am confused, no patch right?
> > Apologies, my omission.
> >
> > > Just measure srcu_get_delay return to non-zero?
> > Exactly, please!
> >
> > > By the way, the issue should be only related with qemu apci. not related
> > > with rmr feature
> > > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> > >
> > > Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
> > > If no acpi, no print kvm_region_add/del (1000 times print once)
> > >
> > > If with acpi=force,
> > > During qemu boot
> > > kvm_region_add region_add = 1000
> > > kvm_region_del region_del = 1000
> > > kvm_region_add region_add = 2000
> > > kvm_region_del region_del = 2000
> > > kvm_region_add region_add = 3000
> > > kvm_region_del region_del = 3000
> > > kvm_region_add region_add = 4000
> > > kvm_region_del region_del = 4000
> > > kvm_region_add region_add = 5000
> > > kvm_region_del region_del = 5000
> > > kvm_region_add region_add = 6000
> > > kvm_region_del region_del = 6000
> > >
> > > kvm_region_add/kvm_region_del ->
> > > kvm_set_phys_mem->
> > > kvm_set_user_memory_region->
> > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> > >
> > > [ 361.094493] __synchronize_srcu loop=9000
> > > [ 361.094501] Call trace:
> > > [ 361.094502] dump_backtrace+0xe4/0xf0
> > > [ 361.094505] show_stack+0x20/0x70
> > > [ 361.094507] dump_stack_lvl+0x8c/0xb8
> > > [ 361.094509] dump_stack+0x18/0x34
> > > [ 361.094511] __synchronize_srcu+0x120/0x128
> > > [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> > > [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> > > [ 361.094519] kvm_activate_memslot+0x40/0x68
> > > [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> > > [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> > > [ 361.094524] kvm_set_memory_region+0x78/0xb8
> > > [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> > > [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> > > [ 361.094530] invoke_syscall+0x4c/0x110
> > > [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> > > [ 361.094536] do_el0_svc+0x34/0xc0
> > > [ 361.094538] el0_svc+0x30/0x98
> > > [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> > > [ 361.094544] el0t_64_sync+0x18c/0x190
> > > [ 363.942817] kvm_set_memory_region loop=6000
> > Good to know, thank you!
> >
> > Thanx, Paul
>
Hi,
On 6/14/2022 2:25 AM, Shameerali Kolothum Thodi wrote:
>
>
>> -----Original Message-----
>> From: Paul E. McKenney [mailto:[email protected]]
>> Sent: 13 June 2022 15:59
>> To: [email protected]
>> Cc: Paolo Bonzini <[email protected]>; Zhangfei Gao
>> <[email protected]>; [email protected];
>> [email protected]; Lai Jiangshan <[email protected]>; Josh Triplett
>> <[email protected]>; Mathieu Desnoyers
>> <[email protected]>; Matthew Wilcox <[email protected]>;
>> Shameerali Kolothum Thodi <[email protected]>;
>> [email protected]; Auger Eric <[email protected]>
>> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
>> blocking readers from consuming CPU) cause qemu boot slow
>>
>> On Mon, Jun 13, 2022 at 09:23:50PM +0800, [email protected]
>> wrote:
>>>
>>>
>>> On 2022/6/13 下午8:18, Paul E. McKenney wrote:
>>>> On Mon, Jun 13, 2022 at 02:55:47PM +0800, [email protected]
>> wrote:
>>>>> Hi, Paul
>>>>>
>>>>> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
>>>>>> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>>>>>>> On Mon, Jun 13, 2022 at 11:04:39AM +0800,
>> [email protected] wrote:
>>>>>>>> Hi, Paul
>>>>>>>>
>>>>>>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>>>>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini
>> wrote:
>>>>>>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>>>>>>> Do these reserved memory regions really need to be
>> allocated separately?
>>>>>>>>>>>> (For example, are they really all non-contiguous? If not,
>> that is, if
>>>>>>>>>>>> there are a lot of contiguous memory regions, could you
>> sort the IORT
>>>>>>>>>>>> by address and do one ioctl() for each set of contiguous
>> memory regions?)
>>>>>>>>>>>>
>>>>>>>>>>>> Are all of these reserved memory regions set up before init
>> is spawned?
>>>>>>>>>>>>
>>>>>>>>>>>> Are all of these reserved memory regions set up while
>> there is only a
>>>>>>>>>>>> single vCPU up and running?
>>>>>>>>>>>>
>>>>>>>>>>>> Is the SRCU grace period really needed in this case? (I
>> freely confess
>>>>>>>>>>>> to not being all that familiar with KVM.)
>>>>>>>>>>> Oh, and there was a similar many-requests problem with
>> networking many
>>>>>>>>>>> years ago. This was solved by adding a new
>> syscall/ioctl()/whatever
>>>>>>>>>>> that permitted many requests to be presented to the kernel
>> with a single
>>>>>>>>>>> system call.
>>>>>>>>>>>
>>>>>>>>>>> Could a new ioctl() be introduced that requested a large
>> number
>>>>>>>>>>> of these memory regions in one go so as to make each call to
>>>>>>>>>>> synchronize_rcu_expedited() cover a useful fraction of your
>> 9000+
>>>>>>>>>>> requests? Adding a few of the KVM guys on CC for their
>> thoughts.
>>>>>>>>>> Unfortunately not. Apart from this specific case, in general
>> the calls to
>>>>>>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to
>> I/O registers in the
>>>>>>>>>> guest, and those writes then map to a ioctl. Typically the
>> guest sets up a
>>>>>>>>>> device at a time, and each setup step causes a
>> synchronize_srcu()---and
>>>>>>>>>> expedited at that.
>>>>>>>>> I was afraid of something like that...
>>>>>>>>>
>>>>>>>>>> KVM has two SRCUs:
>>>>>>>>>>
>>>>>>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it
>> has readers
>>>>>>>>>> that are very very small, but it needs extremely fast detection
>> of grace
>>>>>>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>>>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off
>> kvm->srcu. Readers are
>>>>>>>>>> not so frequent.
>>>>>>>>>>
>>>>>>>>>> 2) kvm->srcu is nastier because there are readers all the time.
>> The
>>>>>>>>>> read-side critical section are still short-ish, but they need the
>> sleepable
>>>>>>>>>> part because they access user memory.
>>>>>>>>> Which one of these two is in play in this case?
>>>>>>>>>
>>>>>>>>>> Writers are not frequent per se; the problem is they come in
>> very large
>>>>>>>>>> bursts when a guest boots. And while the whole boot path
>> overall can be
>>>>>>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have
>> a larger
>>>>>>>>>> impact on runtime than the O(n^2) parts, as demonstrated
>> here.
>>>>>>>>>>
>>>>>>>>>> Therefore, we operated on the assumption that the callers of
>>>>>>>>>> synchronized_srcu_expedited were _anyway_ busy running
>> CPU-bound guest code
>>>>>>>>>> and the desire was to get past the booting phase as fast as
>> possible. If
>>>>>>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it
>> wants;
>>>>>>>>>> therefore, as long as expedited GPs didn't eat CPU
>> *throughout the whole
>>>>>>>>>> system*, a preemptable busy wait in
>> synchronize_srcu_expedited() were not
>>>>>>>>>> problematic.
>>>>>>>>>>
>>>>>>>>>> This assumptions did match the SRCU code when kvm->srcu
>> and kvm->irq_srcu
>>>>>>>>>> were was introduced (respectively in 2009 and 2014). But
>> perhaps they do
>>>>>>>>>> not hold anymore now that each SRCU is not as independent
>> as it used to be
>>>>>>>>>> in those years, and instead they use workqueues instead?
>>>>>>>>> The problem was not internal to SRCU, but rather due to the fact
>>>>>>>>> that kernel live patching (KLP) had problems with the
>> CPU-bound tasks
>>>>>>>>> resulting from repeated synchronize_rcu_expedited()
>> invocations. So I
>>>>>>>>> added heuristics to get the occasional sleep in there for KLP's
>> benefit.
>>>>>>>>> Perhaps these heuristics need to be less aggressive about adding
>> sleep.
>>>>>>>>>
>>>>>>>>> These heuristics have these aspects:
>>>>>>>>>
>>>>>>>>> 1. The longer readers persist in an expedited SRCU grace period,
>>>>>>>>> the longer the wait between successive checks of the reader
>>>>>>>>> state. Roughly speaking, we wait as long as the grace period
>>>>>>>>> has currently been in effect, capped at ten jiffies.
>>>>>>>>>
>>>>>>>>> 2. SRCU grace periods have several phases. We reset so that
>> each
>>>>>>>>> phase starts by not waiting (new phase, new set of readers,
>>>>>>>>> so don't penalize this set for the sins of the previous set).
>>>>>>>>> But once we get to the point of adding delay, we add the
>>>>>>>>> delay based on the beginning of the full grace period.
>>>>>>>>>
>>>>>>>>> Right now, the checking for grace-period length does not allow
>> for the
>>>>>>>>> possibility that a grace period might start just before the jiffies
>>>>>>>>> counter gets incremented (because I didn't realize that anyone
>> cared),
>>>>>>>>> so that is one possible thing to change. I can also allow more
>> no-delay
>>>>>>>>> checks per SRCU grace-period phase.
>>>>>>>>>
>>>>>>>>> Zhangfei, does something like the patch shown below help?
>>>>>>>>>
>>>>>>>>> Additional adjustments are likely needed to avoid re-breaking
>> KLP,
>>>>>>>>> but we have to start somewhere...
>>>>>>>>>
>>>>>>>>> Thanx, Paul
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>>>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>>>>> #define SRCU_INTERVAL 1 // Base delay if no
>> expedited GPs pending.
>>>>>>>>> #define SRCU_MAX_INTERVAL 10 // Maximum
>> incremental delay from slow readers.
>>>>>>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum
>> per-GP-phase consecutive no-delay instances.
>>>>>>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum
>> per-GP-phase consecutive no-delay instances.
>>>>>>>>> #define SRCU_MAX_NODELAY 100 // Maximum
>> consecutive no-delay instances.
>>>>>>>>> /*
>>>>>>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>>>>> */
>>>>>>>>> static unsigned long srcu_get_delay(struct srcu_struct
>> *ssp)
>>>>>>>>> {
>>>>>>>>> + unsigned long gpstart;
>>>>>>>>> + unsigned long j;
>>>>>>>>> unsigned long jbase = SRCU_INTERVAL;
>>>>>>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>>>> jbase = 0;
>>>>>>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>>>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>>>>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>>>>>> + j = jiffies - 1;
>>>>>>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>>>>>> + if (time_after(j, gpstart))
>>>>>>>>> + jbase += j - gpstart;
>>>>>>>>> + }
>>>>>>>>> if (!jbase) {
>>>>>>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>>>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>> SRCU_MAX_NODELAY_PHASE)
>>>>>>>> Unfortunately, this patch does not helpful.
>>>>>>>>
>>>>>>>> Then re-add the debug info.
>>>>>>>>
>>>>>>>> During the qemu boot
>>>>>>>> [ 232.997667] __synchronize_srcu loop=1000
>>>>>>>>
>>>>>>>> [ 361.094493] __synchronize_srcu loop=9000
>>>>>>>> [ 361.094501] Call trace:
>>>>>>>> [ 361.094502] dump_backtrace+0xe4/0xf0
>>>>>>>> [ 361.094505] show_stack+0x20/0x70
>>>>>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>>>>>>>> [ 361.094509] dump_stack+0x18/0x34
>>>>>>>> [ 361.094511] __synchronize_srcu+0x120/0x128
>>>>>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>>>>>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>>>>>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>>>>>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>>>>>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>>>>>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>>>>>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>>>>>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>>>>>>>> [ 361.094530] invoke_syscall+0x4c/0x110
>>>>>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>>>>>>>> [ 361.094536] do_el0_svc+0x34/0xc0
>>>>>>>> [ 361.094538] el0_svc+0x30/0x98
>>>>>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>>>>>>>> [ 361.094544] el0t_64_sync+0x18c/0x190
>>>>>>>> [ 363.942817] kvm_set_memory_region loop=6000
>>>>>>> Huh.
>>>>>>>
>>>>>>> One possibility is that the "if (!jbase)" block needs to be nested
>>>>>>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"
>> block.
>>>>> I test this diff and NO helpful
>>>>>
>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>> index 50ba70f019de..36286a4b74e6 100644
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
>> srcu_struct *ssp)
>>>>>
>>>>> #define SRCU_INTERVAL 1 // Base delay if no
>> expedited GPs
>>>>> pending.
>>>>> #define SRCU_MAX_INTERVAL 10 // Maximum
>> incremental delay from
>>>>> slow readers.
>>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum
>> per-GP-phase consecutive
>>>>> no-delay instances.
>>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum
>> per-GP-phase consecutive
>>>>> no-delay instances.
>>>>> #define SRCU_MAX_NODELAY 100 // Maximum
>> consecutive no-delay
>>>>> instances.
>>>>>
>>>>> /*
>>>>> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
>> srcu_struct
>>>>> *ssp)
>>>>> */
>>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>>> {
>>>>> + unsigned long gpstart;
>>>>> + unsigned long j;
>>>>> unsigned long jbase = SRCU_INTERVAL;
>>>>>
>>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>> jbase = 0;
>>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>>> - jbase += jiffies -
>> READ_ONCE(ssp->srcu_gp_start);
>>>>> - if (!jbase) {
>>>>> - WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>>>>> SRCU_MAX_NODELAY_PHASE)
>>>>> - jbase = 1;
>>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>>> + j = jiffies - 1;
>>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>>> + if (time_after(j, gpstart))
>>>>> + jbase += j - gpstart;
>>>>> +
>>>>> + if (!jbase) {
>>>>>
>> + WRITE_ONCE(ssp->srcu_n_exp_nodelay,
>>>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>>> + if
>> (READ_ONCE(ssp->srcu_n_exp_nodelay) >
>>>>> SRCU_MAX_NODELAY_PHASE)
>>>>> + jbase = 1;
>>>>> + }
>>>>> }
>>>> That is in fact what I was intending you to test, thank you. As you
>>>> say, unfortunately it did not help.
>>>>
>>>> Could you please test removing the "if (!jbase)" block entirely?
>>> Remove "if (!jbase)" block is much faster,
>>> not measure clearly, qemu (with debug version efi) boot seems normally.
>>>
>>> From log timestamp:
>>> [ 114.624713] __synchronize_srcu loop=1000
>>> [ 124.157011] __synchronize_srcu loop=9000
>>>
>>> Several method: timestamps are different.
>>>
>>> 5.19-rc1
>>> [ 94.271350] __synchronize_srcu loop=1001
>>> [ 222.621659] __synchronize_srcu loop=9001
>>>
>>>
>>> With your first diff:
>>> [ 232.997667] __synchronize_srcu loop=1000
>>> [ 361.094493] __synchronize_srcu loop=9000
>>>
>>> Remove "if (!jbase)" block
>>> [ 114.624713] __synchronize_srcu loop=1000
>>> [ 124.157011] __synchronize_srcu loop=9000
>>>
>>>
>>> 5.18 method
>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> + return 0;
>>> + return SRCU_INTERVAL;
>>>
>>> [ 74.598480] __synchronize_srcu loop=9000
>>> [ 68.938297] __synchronize_srcu loop=1000
>>
>> Thank you for the information!
>>
>> What happens if you keep the that "if (!jbase)" block", but set the
>> value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?
>
> From the setup I have, this is almost similar to that of the previous logic(without
> the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
> better compared to 5.19-rc1.
>
> The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),
>
> Guest boot time(using 'time'):
>
> 5.18-rc4 based ~8sec
>
> 5.19-rc1 ~2m43sec
>
> 5.19-rc1+fix1 ~19sec
>
> 5.19-rc1-fix2 ~19sec
>
If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
does it show any difference in boot time?
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
*ssp, struct srcu_node *snp
*/
static void srcu_gp_end(struct srcu_struct *ssp)
{
- unsigned long cbdelay;
+ unsigned long cbdelay = 1;
bool cbs;
bool last_lvl;
int cpu;
@@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
spin_lock_irq_rcu_node(ssp);
idx = rcu_seq_state(ssp->srcu_gp_seq);
WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
- cbdelay = !!srcu_get_delay(ssp);
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+ cbdelay = 0;
+
WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
Thanks
Neeraj
> I will wait for Zhangfei to confirm this on his setup, especially the difference
> compared to 5.18.
>
> Thanks,
> Shameer
>
>> This would be too large for KLP, but my hope is that there is a value
>> of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does
>> this help at all? ;-)
>>
>>>>>> And when I run 10,000 consecutive synchronize_rcu_expedited() calls,
>> the
>>>>>> above change reduces the overhead by more than an order of
>> magnitude.
>>>>>> Except that the overhead of the series is far less than one second,
>>>>>> not the several minutes that you are seeing. So the per-call
>> overhead
>>>>>> decreases from about 17 microseconds to a bit more than one
>> microsecond.
>>>>>>
>>>>>> I could imagine an extra order of magnitude if you are running
>> HZ=100
>>>>>> instead of the HZ=1000 that I am running. But that only gets up to a
>>>>>> few seconds.
>>>> One possible reason for the difference would be if your code has
>>>> SRCU readers.
>>>>
>>>> Could you please tell me the value of CONFIG_HZ on your system?
>>>> Also the value of CONFIG_PREEMPTION?
>>> I am using arch/arm64/configs/defconfig
>>> make defconfig
>>> CONFIG_PREEMPTION=y
>>> CONFIG_HZ_250=y
>>
>> Thank you again!
>>
>> And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
>> depend
>> on HZ. And who knows what all else...
>>
>> Thanx, Paul
>>
>>> Thanks
>>>
>>>>
>>>>>>> One additional debug is to apply the patch below on top of the one
>> you
>>>>> apply the patch below?
>>>>>>> just now kindly tested, then use whatever debug technique you wish
>> to
>>>>>>> work out what fraction of the time during that critical interval that
>>>>>>> srcu_get_delay() returns non-zero.
>>>>> Sorry, I am confused, no patch right?
>>>> Apologies, my omission.
>>>>
>>>>> Just measure srcu_get_delay return to non-zero?
>>>> Exactly, please!
>>>>
>>>>> By the way, the issue should be only related with qemu apci. not related
>>>>> with rmr feature
>>>>> Test with: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>
>>>>> Looks it caused by too many kvm_region_add & kvm_region_del if
>> acpi=force,
>>>>> If no acpi, no print kvm_region_add/del (1000 times print once)
>>>>>
>>>>> If with acpi=force,
>>>>> During qemu boot
>>>>> kvm_region_add region_add = 1000
>>>>> kvm_region_del region_del = 1000
>>>>> kvm_region_add region_add = 2000
>>>>> kvm_region_del region_del = 2000
>>>>> kvm_region_add region_add = 3000
>>>>> kvm_region_del region_del = 3000
>>>>> kvm_region_add region_add = 4000
>>>>> kvm_region_del region_del = 4000
>>>>> kvm_region_add region_add = 5000
>>>>> kvm_region_del region_del = 5000
>>>>> kvm_region_add region_add = 6000
>>>>> kvm_region_del region_del = 6000
>>>>>
>>>>> kvm_region_add/kvm_region_del ->
>>>>> kvm_set_phys_mem->
>>>>> kvm_set_user_memory_region->
>>>>> kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
>>>>>
>>>>> [ 361.094493] __synchronize_srcu loop=9000
>>>>> [ 361.094501] Call trace:
>>>>> [ 361.094502] dump_backtrace+0xe4/0xf0
>>>>> [ 361.094505] show_stack+0x20/0x70
>>>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>>>>> [ 361.094509] dump_stack+0x18/0x34
>>>>> [ 361.094511] __synchronize_srcu+0x120/0x128
>>>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>>>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>>>>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>>>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>>>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>>>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>>>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>>>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>>>>> [ 361.094530] invoke_syscall+0x4c/0x110
>>>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>>>>> [ 361.094536] do_el0_svc+0x34/0xc0
>>>>> [ 361.094538] el0_svc+0x30/0x98
>>>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>>>>> [ 361.094544] el0t_64_sync+0x18c/0x190
>>>>> [ 363.942817] kvm_set_memory_region loop=6000
>>>> Good to know, thank you!
>>>>
>>>> Thanx, Paul
>>>
On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>
>>
>> 5.18-rc4 based ~8sec
>>
>> 5.19-rc1 ~2m43sec
>>
>> 5.19-rc1+fix1 ~19sec
>>
>> 5.19-rc1-fix2 ~19sec
>>
>
> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2
> ; does it show any difference in boot time?
>
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
> srcu_struct *ssp, struct srcu_node *snp
> */
> static void srcu_gp_end(struct srcu_struct *ssp)
> {
> - unsigned long cbdelay;
> + unsigned long cbdelay = 1;
> bool cbs;
> bool last_lvl;
> int cpu;
> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
> spin_lock_irq_rcu_node(ssp);
> idx = rcu_seq_state(ssp->srcu_gp_seq);
> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
> - cbdelay = !!srcu_get_delay(ssp);
> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> + cbdelay = 0;
> +
> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>
Test here:
qemu: https://github.com/qemu/qemu/tree/stable-6.1
kernel:
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
(in case test patch not clear, push in git tree)
Hardware: aarch64
1. 5.18-rc6
real 0m8.402s
user 0m3.015s
sys 0m1.102s
2. 5.19-rc1
real 2m41.433s
user 0m3.097s
sys 0m1.177s
3. 5.19-rc1 + fix1 from Paul
real 2m43.404s
user 0m2.880s
sys 0m1.214s
4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
real 0m15.262s
user 0m3.003s
sys 0m1.033s
When build kernel in the meantime, load time become longer.
5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
real 0m15.215s
user 0m2.942s
sys 0m1.172s
6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
real 1m23.936s
user 0m2.969s
sys 0m1.181s
More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
Thanks
On 6/14/2022 7:33 PM, [email protected] wrote:
>
>
> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>
>>>
>>> 5.18-rc4 based ~8sec
>>>
>>> 5.19-rc1 ~2m43sec
>>>
>>> 5.19-rc1+fix1 ~19sec
>>>
>>> 5.19-rc1-fix2 ~19sec
>>>
>>
>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2
>> ; does it show any difference in boot time?
>>
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>> srcu_struct *ssp, struct srcu_node *snp
>> */
>> static void srcu_gp_end(struct srcu_struct *ssp)
>> {
>> - unsigned long cbdelay;
>> + unsigned long cbdelay = 1;
>> bool cbs;
>> bool last_lvl;
>> int cpu;
>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>> spin_lock_irq_rcu_node(ssp);
>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>> - cbdelay = !!srcu_get_delay(ssp);
>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>> + cbdelay = 0;
>> +
>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>
> Test here:
> qemu: https://github.com/qemu/qemu/tree/stable-6.1
> kernel:
> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
> (in case test patch not clear, push in git tree)
>
> Hardware: aarch64
>
> 1. 5.18-rc6
> real 0m8.402s
> user 0m3.015s
> sys 0m1.102s
>
> 2. 5.19-rc1
> real 2m41.433s
> user 0m3.097s
> sys 0m1.177s
>
> 3. 5.19-rc1 + fix1 from Paul
> real 2m43.404s
> user 0m2.880s
> sys 0m1.214s
>
> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
> real 0m15.262s
> user 0m3.003s
> sys 0m1.033s
>
> When build kernel in the meantime, load time become longer.
>
> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
> real 0m15.215s
> user 0m2.942s
> sys 0m1.172s
>
> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
> real 1m23.936s
> user 0m2.969s
> sys 0m1.181s
>
Thanks for this data. Can you please share below test combo also?
7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end
8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end
Thanks
Neeraj
> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>
> Thanks
>
On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected] wrote:
>
>
> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
> >
> > >
> > > 5.18-rc4 based ~8sec
> > >
> > > 5.19-rc1 ~2m43sec
> > >
> > > 5.19-rc1+fix1 ~19sec
> > >
> > > 5.19-rc1-fix2 ~19sec
> > >
> >
> > If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
> > does it show any difference in boot time?
> >
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
> > *ssp, struct srcu_node *snp
> > */
> > static void srcu_gp_end(struct srcu_struct *ssp)
> > {
> > - unsigned long cbdelay;
> > + unsigned long cbdelay = 1;
> > bool cbs;
> > bool last_lvl;
> > int cpu;
> > @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
> > spin_lock_irq_rcu_node(ssp);
> > idx = rcu_seq_state(ssp->srcu_gp_seq);
> > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
> > - cbdelay = !!srcu_get_delay(ssp);
> > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > + cbdelay = 0;
> > +
> > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
Thank you both for the testing and the proposed fix!
> Test here:
> qemu: https://github.com/qemu/qemu/tree/stable-6.1
> kernel:
> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
> (in case test patch not clear, push in git tree)
>
> Hardware: aarch64
>
> 1. 5.18-rc6
> real 0m8.402s
> user 0m3.015s
> sys 0m1.102s
>
> 2. 5.19-rc1
> real 2m41.433s
> user 0m3.097s
> sys 0m1.177s
>
> 3. 5.19-rc1 + fix1 from Paul
> real 2m43.404s
> user 0m2.880s
> sys 0m1.214s
>
> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
> real 0m15.262s
> user 0m3.003s
> sys 0m1.033s
>
> When build kernel in the meantime, load time become longer.
>
> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
> real 0m15.215s
> user 0m2.942s
> sys 0m1.172s
>
> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
> real 1m23.936s
> user 0m2.969s
> sys 0m1.181s
And thank you for the testing!
Could you please try fix3 + Neeraj's change of srcu_gp_end?
That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
srcu_gp_end.
Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
times start rising? This is probably best done by starting with
SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
until boot time becomes slow, followed by a binary search between the
last two values. (The idea is to bias the search so that fast boot
times are the common case.)
> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
And thank you for these details.
Thanx, Paul
On 2022/6/14 下午10:14, Neeraj Upadhyay wrote:
>
>
> On 6/14/2022 7:33 PM, [email protected] wrote:
>>
>>
>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>
>>>>
>>>> 5.18-rc4 based ~8sec
>>>>
>>>> 5.19-rc1 ~2m43sec
>>>>
>>>> 5.19-rc1+fix1 ~19sec
>>>>
>>>> 5.19-rc1-fix2 ~19sec
>>>>
>>>
>>> If you try below diff on top of either 5.19-rc1+fix1 or
>>> 5.19-rc1-fix2 ; does it show any difference in boot time?
>>>
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>> srcu_struct *ssp, struct srcu_node *snp
>>> */
>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>> {
>>> - unsigned long cbdelay;
>>> + unsigned long cbdelay = 1;
>>> bool cbs;
>>> bool last_lvl;
>>> int cpu;
>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>> spin_lock_irq_rcu_node(ssp);
>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>> - cbdelay = !!srcu_get_delay(ssp);
>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> + cbdelay = 0;
>>> +
>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>>
>> Test here:
>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>> kernel:
>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>> (in case test patch not clear, push in git tree)
>>
>> Hardware: aarch64
>>
>> 1. 5.18-rc6
>> real 0m8.402s
>> user 0m3.015s
>> sys 0m1.102s
>>
>> 2. 5.19-rc1
>> real 2m41.433s
>> user 0m3.097s
>> sys 0m1.177s
>>
>> 3. 5.19-rc1 + fix1 from Paul
>> real 2m43.404s
>> user 0m2.880s
>> sys 0m1.214s
>>
>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>> real 0m15.262s
>> user 0m3.003s
>> sys 0m1.033s
>>
>> When build kernel in the meantime, load time become longer.
>>
>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>> real 0m15.215s
>> user 0m2.942s
>> sys 0m1.172s
>>
>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>> real 1m23.936s
>> user 0m2.969s
>> sys 0m1.181s
>>
>
7. 5.19-rc1 + fix5: fix4 + Remove "if (!jbase)" block
real 0m11.418s
user 0m3.031s
sys 0m1.067s
8. 5.19-rc1 + fix 6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
real 0m11.154s ~12s
user 0m2.919s
sys 0m1.064s
Thanks
> Thanks for this data. Can you please share below test combo also?
>
> 7. 5.19-rc1 + fix5: fix2 + Neeraj's change of srcu_gp_end
>
>
> 8. 5.19-rc1 + fix6: fix3 + Neeraj's change of srcu_gp_end
>
>
> Thanks
> Neeraj
>
>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>
>> Thanks
>>
On Mon, 13 Jun 2022 07:55:47 +0100,
"[email protected]" <[email protected]> wrote:
>
> Hi, Paul
>
> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, [email protected] wrote:
> >>> Hi, Paul
> >>>
> >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> >>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
> >>>>>>> Do these reserved memory regions really need to be allocated separately?
> >>>>>>> (For example, are they really all non-contiguous? If not, that is, if
> >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
> >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
> >>>>>>>
> >>>>>>> Are all of these reserved memory regions set up before init is spawned?
> >>>>>>>
> >>>>>>> Are all of these reserved memory regions set up while there is only a
> >>>>>>> single vCPU up and running?
> >>>>>>>
> >>>>>>> Is the SRCU grace period really needed in this case? (I freely confess
> >>>>>>> to not being all that familiar with KVM.)
> >>>>>> Oh, and there was a similar many-requests problem with networking many
> >>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever
> >>>>>> that permitted many requests to be presented to the kernel with a single
> >>>>>> system call.
> >>>>>>
> >>>>>> Could a new ioctl() be introduced that requested a large number
> >>>>>> of these memory regions in one go so as to make each call to
> >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> >>>>>> requests? Adding a few of the KVM guys on CC for their thoughts.
> >>>>> Unfortunately not. Apart from this specific case, in general the calls to
> >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> >>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a
> >>>>> device at a time, and each setup step causes a synchronize_srcu()---and
> >>>>> expedited at that.
> >>>> I was afraid of something like that...
> >>>>
> >>>>> KVM has two SRCUs:
> >>>>>
> >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> >>>>> that are very very small, but it needs extremely fast detection of grace
> >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
> >>>>> not so frequent.
> >>>>>
> >>>>> 2) kvm->srcu is nastier because there are readers all the time. The
> >>>>> read-side critical section are still short-ish, but they need the sleepable
> >>>>> part because they access user memory.
> >>>> Which one of these two is in play in this case?
> >>>>
> >>>>> Writers are not frequent per se; the problem is they come in very large
> >>>>> bursts when a guest boots. And while the whole boot path overall can be
> >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> >>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
> >>>>>
> >>>>> Therefore, we operated on the assumption that the callers of
> >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> >>>>> and the desire was to get past the booting phase as fast as possible. If
> >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> >>>>> problematic.
> >>>>>
> >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> >>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do
> >>>>> not hold anymore now that each SRCU is not as independent as it used to be
> >>>>> in those years, and instead they use workqueues instead?
> >>>> The problem was not internal to SRCU, but rather due to the fact
> >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
> >>>> resulting from repeated synchronize_rcu_expedited() invocations. So I
> >>>> added heuristics to get the occasional sleep in there for KLP's benefit.
> >>>> Perhaps these heuristics need to be less aggressive about adding sleep.
> >>>>
> >>>> These heuristics have these aspects:
> >>>>
> >>>> 1. The longer readers persist in an expedited SRCU grace period,
> >>>> the longer the wait between successive checks of the reader
> >>>> state. Roughly speaking, we wait as long as the grace period
> >>>> has currently been in effect, capped at ten jiffies.
> >>>>
> >>>> 2. SRCU grace periods have several phases. We reset so that each
> >>>> phase starts by not waiting (new phase, new set of readers,
> >>>> so don't penalize this set for the sins of the previous set).
> >>>> But once we get to the point of adding delay, we add the
> >>>> delay based on the beginning of the full grace period.
> >>>>
> >>>> Right now, the checking for grace-period length does not allow for the
> >>>> possibility that a grace period might start just before the jiffies
> >>>> counter gets incremented (because I didn't realize that anyone cared),
> >>>> so that is one possible thing to change. I can also allow more no-delay
> >>>> checks per SRCU grace-period phase.
> >>>>
> >>>> Zhangfei, does something like the patch shown below help?
> >>>>
> >>>> Additional adjustments are likely needed to avoid re-breaking KLP,
> >>>> but we have to start somewhere...
> >>>>
> >>>> Thanx, Paul
> >>>>
> >>>> ------------------------------------------------------------------------
> >>>>
> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> >>>> index 50ba70f019dea..6a354368ac1d1 100644
> >>>> --- a/kernel/rcu/srcutree.c
> >>>> +++ b/kernel/rcu/srcutree.c
> >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
> >>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
> >>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
> >>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
> >>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
> >>>> /*
> >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>> */
> >>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> >>>> {
> >>>> + unsigned long gpstart;
> >>>> + unsigned long j;
> >>>> unsigned long jbase = SRCU_INTERVAL;
> >>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> >>>> jbase = 0;
> >>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> >>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> >>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> >>>> + j = jiffies - 1;
> >>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
> >>>> + if (time_after(j, gpstart))
> >>>> + jbase += j - gpstart;
> >>>> + }
> >>>> if (!jbase) {
> >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> >>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> >>> Unfortunately, this patch does not helpful.
> >>>
> >>> Then re-add the debug info.
> >>>
> >>> During the qemu boot
> >>> [ 232.997667] __synchronize_srcu loop=1000
> >>>
> >>> [ 361.094493] __synchronize_srcu loop=9000
> >>> [ 361.094501] Call trace:
> >>> [ 361.094502] dump_backtrace+0xe4/0xf0
> >>> [ 361.094505] show_stack+0x20/0x70
> >>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
> >>> [ 361.094509] dump_stack+0x18/0x34
> >>> [ 361.094511] __synchronize_srcu+0x120/0x128
> >>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
> >>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
> >>> [ 361.094519] kvm_activate_memslot+0x40/0x68
> >>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
> >>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
> >>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
> >>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
> >>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
> >>> [ 361.094530] invoke_syscall+0x4c/0x110
> >>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
> >>> [ 361.094536] do_el0_svc+0x34/0xc0
> >>> [ 361.094538] el0_svc+0x30/0x98
> >>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
> >>> [ 361.094544] el0t_64_sync+0x18c/0x190
> >>> [ 363.942817] kvm_set_memory_region loop=6000
> >> Huh.
> >>
> >> One possibility is that the "if (!jbase)" block needs to be nested
> >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
>
> I test this diff and NO helpful
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019de..36286a4b74e6 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>
> #define SRCU_INTERVAL 1 // Base delay if no expedited
> GPs pending.
> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay
> from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase
> consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase
> consecutive no-delay instances.
> #define SRCU_MAX_NODELAY 100 // Maximum consecutive
> no-delay instances.
>
> /*
> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> */
> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> {
> + unsigned long gpstart;
> + unsigned long j;
> unsigned long jbase = SRCU_INTERVAL;
>
> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> jbase = 0;
> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> - if (!jbase) {
> - WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> - jbase = 1;
> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> + j = jiffies - 1;
> + gpstart = READ_ONCE(ssp->srcu_gp_start);
> + if (time_after(j, gpstart))
> + jbase += j - gpstart;
> +
> + if (!jbase) {
> + WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> + jbase = 1;
> + }
> }
>
> > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > above change reduces the overhead by more than an order of magnitude.
> > Except that the overhead of the series is far less than one second,
> > not the several minutes that you are seeing. So the per-call overhead
> > decreases from about 17 microseconds to a bit more than one microsecond.
> >
> > I could imagine an extra order of magnitude if you are running HZ=100
> > instead of the HZ=1000 that I am running. But that only gets up to a
> > few seconds.
> >
> >> One additional debug is to apply the patch below on top of the one you
> apply the patch below?
> >> just now kindly tested, then use whatever debug technique you wish to
> >> work out what fraction of the time during that critical interval that
> >> srcu_get_delay() returns non-zero.
> Sorry, I am confused, no patch right?
> Just measure srcu_get_delay return to non-zero?
>
>
> By the way, the issue should be only related with qemu apci. not
> related with rmr feature
No, this also occurs if you supply the guest's EFI with an empty set
of persistent variables. EFI goes and zeroes it, which results in a
read-only memslot write access being taken to userspace, the memslot
being unmapped from the guest, QEMU doing a little dance, and
eventually restoring the memslot back to the guest. Rince, repeat.
Do that one byte at a time over 64MB, and your boot time for EFI only
goes from 39s to 3m50s (that's on a speed-challenged Synquacer box),
which completely kills the "deploy a new VM" use case.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 13 Jun 2022 16:02:30 +0100,
Shameerali Kolothum Thodi <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: 13 June 2022 07:56
> > To: [email protected]
> > Cc: Paolo Bonzini <[email protected]>; Zhangfei Gao
> > <[email protected]>; [email protected];
> > [email protected]; Lai Jiangshan <[email protected]>; Josh Triplett
> > <[email protected]>; Mathieu Desnoyers
> > <[email protected]>; Matthew Wilcox <[email protected]>;
> > Shameerali Kolothum Thodi <[email protected]>;
> > [email protected]; Auger Eric <[email protected]>
> > Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
> > blocking readers from consuming CPU) cause qemu boot slow
> >
>
> > By the way, the issue should be only related with qemu apci. not related
> > with rmr feature
> > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> >
> > Looks it caused by too many kvm_region_add & kvm_region_del if
> > acpi=force,
>
> Based on the setup I have, I think it has nothing to do with Guest
> kernel booting with ACPI per se(ie, acpi=force in Qemu kernel cmd
> line). It is more to do with Qemu having the "-bios QEMU_EFI.fd"
> which sets up pflash devices resulting in large number of pflash
> read/write calls(before Guest kernel even boots) which in turn seems
> to be triggering the below kvm_region_add/del calls.
Indeed, this is all about memslots being added/removed at an alarming
rate (well, not more alarming today than yesterday, but QEMU's flash
emulation is... interesting).
M.
--
Without deviation from the norm, progress is not possible.
On 2022/6/14 下午10:17, Paul E. McKenney wrote:
> On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected] wrote:
>>
>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>> 5.18-rc4 based ~8sec
>>>>
>>>> 5.19-rc1 ~2m43sec
>>>>
>>>> 5.19-rc1+fix1 ~19sec
>>>>
>>>> 5.19-rc1-fix2 ~19sec
>>>>
>>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
>>> does it show any difference in boot time?
>>>
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
>>> *ssp, struct srcu_node *snp
>>> */
>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>> {
>>> - unsigned long cbdelay;
>>> + unsigned long cbdelay = 1;
>>> bool cbs;
>>> bool last_lvl;
>>> int cpu;
>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>> spin_lock_irq_rcu_node(ssp);
>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>> - cbdelay = !!srcu_get_delay(ssp);
>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>> + cbdelay = 0;
>>> +
>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
> Thank you both for the testing and the proposed fix!
>
>> Test here:
>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>> kernel:
>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>> (in case test patch not clear, push in git tree)
>>
>> Hardware: aarch64
>>
>> 1. 5.18-rc6
>> real 0m8.402s
>> user 0m3.015s
>> sys 0m1.102s
>>
>> 2. 5.19-rc1
>> real 2m41.433s
>> user 0m3.097s
>> sys 0m1.177s
>>
>> 3. 5.19-rc1 + fix1 from Paul
>> real 2m43.404s
>> user 0m2.880s
>> sys 0m1.214s
>>
>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>> real 0m15.262s
>> user 0m3.003s
>> sys 0m1.033s
>>
>> When build kernel in the meantime, load time become longer.
>>
>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>> real 0m15.215s
>> user 0m2.942s
>> sys 0m1.172s
>>
>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>> real 1m23.936s
>> user 0m2.969s
>> sys 0m1.181s
> And thank you for the testing!
>
> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>
> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
> srcu_gp_end.
>
> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
> times start rising? This is probably best done by starting with
> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
> until boot time becomes slow, followed by a binary search between the
> last two values. (The idea is to bias the search so that fast boot
> times are the common case.)
SRCU_MAX_NODELAY_PHASE 100 becomes slower.
8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
real 0m11.154s ~12s
user 0m2.919s
sys 0m1.064s
9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
real 0m11.258s
user 0m3.113s
sys 0m1.073s
10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
real 0m30.053s ~ 32s
user 0m2.827s
sys 0m1.161s
By the way, if build kernel on the board in the meantime (using memory),
time become much longer.
real 1m2.763s
11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
real 0m11.443s
user 0m3.022s
sys 0m1.052s
Thanks
>
>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
> And thank you for these details.
>
> Thanx, Paul
Hi,
On 6/15/2022 2:33 PM, [email protected] wrote:
>
>
> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected] wrote:
>>>
>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>> 5.18-rc4 based ~8sec
>>>>>
>>>>> 5.19-rc1 ~2m43sec
>>>>>
>>>>> 5.19-rc1+fix1 ~19sec
>>>>>
>>>>> 5.19-rc1-fix2 ~19sec
>>>>>
>>>> If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
>>>> does it show any difference in boot time?
>>>>
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>>> srcu_struct
>>>> *ssp, struct srcu_node *snp
>>>> */
>>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>>> {
>>>> - unsigned long cbdelay;
>>>> + unsigned long cbdelay = 1;
>>>> bool cbs;
>>>> bool last_lvl;
>>>> int cpu;
>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>> spin_lock_irq_rcu_node(ssp);
>>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>> - cbdelay = !!srcu_get_delay(ssp);
>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>> + cbdelay = 0;
>>>> +
>>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>> Thank you both for the testing and the proposed fix!
>>
>>> Test here:
>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>> kernel:
>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>>>
>>> (in case test patch not clear, push in git tree)
>>>
>>> Hardware: aarch64
>>>
>>> 1. 5.18-rc6
>>> real 0m8.402s
>>> user 0m3.015s
>>> sys 0m1.102s
>>>
>>> 2. 5.19-rc1
>>> real 2m41.433s
>>> user 0m3.097s
>>> sys 0m1.177s
>>>
>>> 3. 5.19-rc1 + fix1 from Paul
>>> real 2m43.404s
>>> user 0m2.880s
>>> sys 0m1.214s
>>>
>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>> real 0m15.262s
>>> user 0m3.003s
>>> sys 0m1.033s
>>>
>>> When build kernel in the meantime, load time become longer.
>>>
>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>> real 0m15.215s
>>> user 0m2.942s
>>> sys 0m1.172s
>>>
>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>>> real 1m23.936s
>>> user 0m2.969s
>>> sys 0m1.181s
>> And thank you for the testing!
>>
>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>
>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>> srcu_gp_end.
>>
>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>> times start rising? This is probably best done by starting with
>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>> until boot time becomes slow, followed by a binary search between the
>> last two values. (The idea is to bias the search so that fast boot
>> times are the common case.)
>
> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>
>
> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>
> real 0m11.154s ~12s
>
> user 0m2.919s
>
> sys 0m1.064s
>
>
>
> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>
> real 0m11.258s
>
> user 0m3.113s
>
> sys 0m1.073s
>
>
>
> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>
> real 0m30.053s ~ 32s
>
> user 0m2.827s
>
> sys 0m1.161s
>
>
>
> By the way, if build kernel on the board in the meantime (using memory),
> time become much longer.
>
> real 1m2.763s
>
>
>
> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>
> real 0m11.443s
>
> user 0m3.022s
>
> sys 0m1.052s
>
>
This is useful data, thanks! Did you get chance to check between 100 and
1000, to narrow down further, from which point (does need to be exact
value) between 100 and 1000, you start seeing degradation at, for ex.
250, 500 , ...?
Is it also possible to try experiment 10 and 11 with below diff.
What I have done in below diff is, call srcu_get_delay() only once
in try_check_zero() (and not for every loop iteration); also
retry with a different delay for the extra iteration which is done
when srcu_get_delay(ssp) returns 0.
Once we have this data, can you also try by changing
SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
#define SRCU_RETRY_CHECK_LONG_DELAY 100
-------------------------------------------------------------------------
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 6a354368ac1d..3aff2f3e99ab 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
* we repeatedly block for 1-millisecond time periods.
*/
#define SRCU_RETRY_CHECK_DELAY 5
+#define SRCU_RETRY_CHECK_LONG_DELAY 5
/*
* Start an SRCU grace period.
@@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct
srcu_struct *ssp, struct srcu_data *sdp,
*/
static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
{
+ unsigned long curdelay;
+ curdelay = !srcu_get_delay(ssp);
for (;;) {
if (srcu_readers_active_idx_check(ssp, idx))
return true;
- if (--trycount + !srcu_get_delay(ssp) <= 0)
+ if (--trycount + curdelay <= 0)
return false;
- udelay(SRCU_RETRY_CHECK_DELAY);
+ if (trycount)
+ udelay(SRCU_RETRY_CHECK_DELAY);
+ else
+ udelay(SRCU_RETRY_CHECK_LONG_DELAY);
}
}
Thanks
Neeraj
> Thanks
>
>>
>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>> And thank you for these details.
>>
>> Thanx, Paul
>
On 6/15/2022 4:20 PM, Paolo Bonzini wrote:
> On 6/15/22 12:40, Neeraj Upadhyay wrote:
>>
>> This is useful data, thanks! Did you get chance to check between 100
>> and 1000, to narrow down further, from which point (does need to be
>> exact value) between 100 and 1000, you start seeing degradation at,
>> for ex. 250, 500 , ...?
>>
>> Is it also possible to try experiment 10 and 11 with below diff.
>> What I have done in below diff is, call srcu_get_delay() only once
>> in try_check_zero() (and not for every loop iteration); also
>> retry with a different delay for the extra iteration which is done
>> when srcu_get_delay(ssp) returns 0.
>>
>> Once we have this data, can you also try by changing
>> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>>
>> #define SRCU_RETRY_CHECK_LONG_DELAY 100
>
> Is there any data that you would like me to gather from the KVM side,
> for example with respect to how much it takes to do synchronize_srcu()
> on an unpatched kernel, or the duration of the read-sides?
>
Hi Paolo,
Thanks! I think synchronize_srcu() time and read side duration will
help. Given that changing SRCU_MAX_NODELAY_PHASE value has impact
on the boot time (and SRCU_MAX_NODELAY_PHASE impact is dependent on
duration of SRCU read side duration), this information will be helpful
to get more understanding of the scenario.
Thanks
Neeraj
> Thanks,
>
> Paolo
>
On 6/15/22 12:40, Neeraj Upadhyay wrote:
>
> This is useful data, thanks! Did you get chance to check between 100 and
> 1000, to narrow down further, from which point (does need to be exact
> value) between 100 and 1000, you start seeing degradation at, for ex.
> 250, 500 , ...?
>
> Is it also possible to try experiment 10 and 11 with below diff.
> What I have done in below diff is, call srcu_get_delay() only once
> in try_check_zero() (and not for every loop iteration); also
> retry with a different delay for the extra iteration which is done
> when srcu_get_delay(ssp) returns 0.
>
> Once we have this data, can you also try by changing
> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>
> #define SRCU_RETRY_CHECK_LONG_DELAY 100
Is there any data that you would like me to gather from the KVM side,
for example with respect to how much it takes to do synchronize_srcu()
on an unpatched kernel, or the duration of the read-sides?
Thanks,
Paolo
On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
> Hi,
>
> On 6/15/2022 2:33 PM, [email protected] wrote:
>>
>>
>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected]
>>> wrote:
>>>>
>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>> 5.18-rc4 based ~8sec
>>>>>>
>>>>>> 5.19-rc1 ~2m43sec
>>>>>>
>>>>>> 5.19-rc1+fix1 ~19sec
>>>>>>
>>>>>> 5.19-rc1-fix2 ~19sec
>>>>>>
>>>>> If you try below diff on top of either 5.19-rc1+fix1 or
>>>>> 5.19-rc1-fix2 ;
>>>>> does it show any difference in boot time?
>>>>>
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>>>> srcu_struct
>>>>> *ssp, struct srcu_node *snp
>>>>> */
>>>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>>>> {
>>>>> - unsigned long cbdelay;
>>>>> + unsigned long cbdelay = 1;
>>>>> bool cbs;
>>>>> bool last_lvl;
>>>>> int cpu;
>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>>> spin_lock_irq_rcu_node(ssp);
>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>> - cbdelay = !!srcu_get_delay(ssp);
>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>> + cbdelay = 0;
>>>>> +
>>>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>> Thank you both for the testing and the proposed fix!
>>>
>>>> Test here:
>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>> kernel:
>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>>>>
>>>> (in case test patch not clear, push in git tree)
>>>>
>>>> Hardware: aarch64
>>>>
>>>> 1. 5.18-rc6
>>>> real 0m8.402s
>>>> user 0m3.015s
>>>> sys 0m1.102s
>>>>
>>>> 2. 5.19-rc1
>>>> real 2m41.433s
>>>> user 0m3.097s
>>>> sys 0m1.177s
>>>>
>>>> 3. 5.19-rc1 + fix1 from Paul
>>>> real 2m43.404s
>>>> user 0m2.880s
>>>> sys 0m1.214s
>>>>
>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>> real 0m15.262s
>>>> user 0m3.003s
>>>> sys 0m1.033s
>>>>
>>>> When build kernel in the meantime, load time become longer.
>>>>
>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>> real 0m15.215s
>>>> user 0m2.942s
>>>> sys 0m1.172s
>>>>
>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>>>> real 1m23.936s
>>>> user 0m2.969s
>>>> sys 0m1.181s
>>> And thank you for the testing!
>>>
>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>
>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>> srcu_gp_end.
>>>
>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>> times start rising? This is probably best done by starting with
>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>> until boot time becomes slow, followed by a binary search between the
>>> last two values. (The idea is to bias the search so that fast boot
>>> times are the common case.)
>>
>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>
>>
>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>
>> real 0m11.154s ~12s
>>
>> user 0m2.919s
>>
>> sys 0m1.064s
>>
>>
>>
>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>
>> real 0m11.258s
>>
>> user 0m3.113s
>>
>> sys 0m1.073s
>>
>>
>>
>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>
>> real 0m30.053s ~ 32s
>>
>> user 0m2.827s
>>
>> sys 0m1.161s
>>
>>
>>
>> By the way, if build kernel on the board in the meantime (using
>> memory), time become much longer.
>>
>> real 1m2.763s
>>
>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>
>> real 0m11.443s
>>
>> user 0m3.022s
>>
>> sys 0m1.052s
>>
>>
>
> This is useful data, thanks! Did you get chance to check between 100
> and 1000, to narrow down further, from which point (does need to be
> exact value) between 100 and 1000, you start seeing degradation at,
> for ex. 250, 500 , ...?
>
> Is it also possible to try experiment 10 and 11 with below diff.
> What I have done in below diff is, call srcu_get_delay() only once
> in try_check_zero() (and not for every loop iteration); also
> retry with a different delay for the extra iteration which is done
> when srcu_get_delay(ssp) returns 0.
>
> Once we have this data, can you also try by changing
> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>
> #define SRCU_RETRY_CHECK_LONG_DELAY 100
>
> -------------------------------------------------------------------------
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 6a354368ac1d..3aff2f3e99ab 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> * we repeatedly block for 1-millisecond time periods.
> */
> #define SRCU_RETRY_CHECK_DELAY 5
> +#define SRCU_RETRY_CHECK_LONG_DELAY 5
>
> /*
> * Start an SRCU grace period.
> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct
> srcu_struct *ssp, struct srcu_data *sdp,
> */
> static bool try_check_zero(struct srcu_struct *ssp, int idx, int
> trycount)
> {
> + unsigned long curdelay;
> + curdelay = !srcu_get_delay(ssp);
> for (;;) {
> if (srcu_readers_active_idx_check(ssp, idx))
> return true;
> - if (--trycount + !srcu_get_delay(ssp) <= 0)
> + if (--trycount + curdelay <= 0)
> return false;
> - udelay(SRCU_RETRY_CHECK_DELAY);
> + if (trycount)
> + udelay(SRCU_RETRY_CHECK_DELAY);
> + else
> + udelay(SRCU_RETRY_CHECK_LONG_DELAY);
> }
> }
>
11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
real 0m11.443
s user 0m3.022
s sys 0m1.052s
fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
real 0m11.401s
user 0m2.798s
sys 0m1.328s
fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
real 0m15.748s
user 0m2.781s
sys 0m1.294s
fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
real 0m20.704s 21
user 0m2.954s
sys 0m1.226s
fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
real 0m25.151s
user 0m2.980s
sys 0m1.256s
fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
real 0m30.053s ~ 32s
user 0m2.827s
sys 0m1.161s
fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
real 0m19.263s
user 0m3.018s
sys 0m1.211s
fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 +
SRCU_RETRY_CHECK_LONG_DELAY 100
real 0m9.347s
user 0m3.132s
sys 0m1.041s
And Shameer suggests this method, to decrease region_add/del time from
6000+ to 200+, also works on 5.19-rc1
Make the EFI flash image file
$ dd if=/dev/zero of=flash0.img bs=1M count=64
$ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc
$ dd if=/dev/zero of=flash1.img bs=1M count=64
Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line.
-pflash flash0.img -pflash flash1.img \
Thanks
>
> Thanks
> Neeraj
>
>> Thanks
>>
>>>
>>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>> And thank you for these details.
>>>
>>> Thanx, Paul
>>
Hi,
On 6/18/2022 8:37 AM, [email protected] wrote:
>
>
> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>> Hi,
>>
>> On 6/15/2022 2:33 PM, [email protected] wrote:
>>>
>>>
>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected]
>>>> wrote:
>>>>>
>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>> 5.18-rc4 based ~8sec
>>>>>>>
>>>>>>> 5.19-rc1 ~2m43sec
>>>>>>>
>>>>>>> 5.19-rc1+fix1 ~19sec
>>>>>>>
>>>>>>> 5.19-rc1-fix2 ~19sec
>>>>>>>
>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or
>>>>>> 5.19-rc1-fix2 ;
>>>>>> does it show any difference in boot time?
>>>>>>
>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>>>>> srcu_struct
>>>>>> *ssp, struct srcu_node *snp
>>>>>> */
>>>>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>> {
>>>>>> - unsigned long cbdelay;
>>>>>> + unsigned long cbdelay = 1;
>>>>>> bool cbs;
>>>>>> bool last_lvl;
>>>>>> int cpu;
>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>> spin_lock_irq_rcu_node(ssp);
>>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>> - cbdelay = !!srcu_get_delay(ssp);
>>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>> + cbdelay = 0;
>>>>>> +
>>>>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
>>>> Thank you both for the testing and the proposed fix!
>>>>
>>>>> Test here:
>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>> kernel:
>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>>>>>
>>>>> (in case test patch not clear, push in git tree)
>>>>>
>>>>> Hardware: aarch64
>>>>>
>>>>> 1. 5.18-rc6
>>>>> real 0m8.402s
>>>>> user 0m3.015s
>>>>> sys 0m1.102s
>>>>>
>>>>> 2. 5.19-rc1
>>>>> real 2m41.433s
>>>>> user 0m3.097s
>>>>> sys 0m1.177s
>>>>>
>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>> real 2m43.404s
>>>>> user 0m2.880s
>>>>> sys 0m1.214s
>>>>>
>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>> real 0m15.262s
>>>>> user 0m3.003s
>>>>> sys 0m1.033s
>>>>>
>>>>> When build kernel in the meantime, load time become longer.
>>>>>
>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>> real 0m15.215s
>>>>> user 0m2.942s
>>>>> sys 0m1.172s
>>>>>
>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>>>>> real 1m23.936s
>>>>> user 0m2.969s
>>>>> sys 0m1.181s
>>>> And thank you for the testing!
>>>>
>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>
>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>> srcu_gp_end.
>>>>
>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>> times start rising? This is probably best done by starting with
>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>> until boot time becomes slow, followed by a binary search between the
>>>> last two values. (The idea is to bias the search so that fast boot
>>>> times are the common case.)
>>>
>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>
>>>
>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>
>>> real 0m11.154s ~12s
>>>
>>> user 0m2.919s
>>>
>>> sys 0m1.064s
>>>
>>>
>>>
>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>
>>> real 0m11.258s
>>>
>>> user 0m3.113s
>>>
>>> sys 0m1.073s
>>>
>>>
>>>
>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>
>>> real 0m30.053s ~ 32s
>>>
>>> user 0m2.827s
>>>
>>> sys 0m1.161s
>>>
>>>
>>>
>>> By the way, if build kernel on the board in the meantime (using
>>> memory), time become much longer.
>>>
>>> real 1m2.763s
>>>
>>>
>>>
>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>
>>> real 0m11.443s
>>>
>>> user 0m3.022s
>>>
>>> sys 0m1.052s
>>>
>>>
>>
>> This is useful data, thanks! Did you get chance to check between 100
>> and 1000, to narrow down further, from which point (does need to be
>> exact value) between 100 and 1000, you start seeing degradation at,
>> for ex. 250, 500 , ...?
>>
>> Is it also possible to try experiment 10 and 11 with below diff.
>> What I have done in below diff is, call srcu_get_delay() only once
>> in try_check_zero() (and not for every loop iteration); also
>> retry with a different delay for the extra iteration which is done
>> when srcu_get_delay(ssp) returns 0.
>>
>> Once we have this data, can you also try by changing
>> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>>
>> #define SRCU_RETRY_CHECK_LONG_DELAY 100
>>
>> -------------------------------------------------------------------------
>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>> index 6a354368ac1d..3aff2f3e99ab 100644
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>> * we repeatedly block for 1-millisecond time periods.
>> */
>> #define SRCU_RETRY_CHECK_DELAY 5
>> +#define SRCU_RETRY_CHECK_LONG_DELAY 5
>>
>> /*
>> * Start an SRCU grace period.
>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct
>> srcu_struct *ssp, struct srcu_data *sdp,
>> */
>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int
>> trycount)
>> {
>> + unsigned long curdelay;
>> + curdelay = !srcu_get_delay(ssp);
>> for (;;) {
>> if (srcu_readers_active_idx_check(ssp, idx))
>> return true;
>> - if (--trycount + !srcu_get_delay(ssp) <= 0)
>> + if (--trycount + curdelay <= 0)
>> return false;
>> - udelay(SRCU_RETRY_CHECK_DELAY);
>> + if (trycount)
>> + udelay(SRCU_RETRY_CHECK_DELAY);
>> + else
>> + udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>> }
>> }
>>
>
> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
> real 0m11.443
> s user 0m3.022
> s sys 0m1.052s
>
> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
>
> real 0m11.401s
> user 0m2.798s
> sys 0m1.328s
>
>
> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
>
> real 0m15.748s
> user 0m2.781s
> sys 0m1.294s
>
>
> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
>
> real 0m20.704s 21
> user 0m2.954s
> sys 0m1.226s
>
> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
>
> real 0m25.151s
> user 0m2.980s
> sys 0m1.256s
>
>
> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
> real 0m30.053s ~ 32s
> user 0m2.827s
> sys 0m1.161s
>
>
> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
>
> real 0m19.263s
> user 0m3.018s
> sys 0m1.211s
>
>
>
> fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 +
> SRCU_RETRY_CHECK_LONG_DELAY 100
>
> real 0m9.347s
> user 0m3.132s
> sys 0m1.041s
>
>
Thanks. From the data and experiments done, looks to me that we get
comparable (to 5.18-rc4 ) timings, when we retry without sleep for time
duration close to 4-5 ms, which could be closer to the configured HZ (as
it is 250)? Is it possible to try below configuration on top of fix15?
If possible can you try with both HZ_1000 and HZ_250?
As multiple fixes are getting combined in experiments, for clarity,
please also share the diff of srcutree.c (on top of baseline) for all
experiments.
16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
(long delay scaled to 1 jiffy)
#define SRCU_MAX_NODELAY_TRY_CHECK_PHASE 10
#define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
#define SRCU_RETRY_CHECK_LONG_DELAY \
(USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
(long delay scaled to 2 jiffy)
#define SRCU_RETRY_CHECK_LONG_DELAY \
(2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
(long delay scaled to 1/2 jiffy)
#define SRCU_RETRY_CHECK_LONG_DELAY \
(USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)
Thanks
Neeraj
> And Shameer suggests this method, to decrease region_add/del time from
> 6000+ to 200+, also works on 5.19-rc1
>
> Make the EFI flash image file
> $ dd if=/dev/zero of=flash0.img bs=1M count=64
> $ dd if=./QEMU_EFI-2022.fd of=flash0.img conv=notrunc
> $ dd if=/dev/zero of=flash1.img bs=1M count=64
> Include the below line instead of "-bios QEMU_EFI.fd" in Qemu cmd line.
> -pflash flash0.img -pflash flash1.img \
>
Thanks for sharing this wa info!
>
>
> Thanks
>
>>
>> Thanks
>> Neeraj
>>
>>> Thanks
>>>
>>>>
>>>>> More test details: https://docs.qq.com/doc/DRXdKalFPTVlUbFN5
>>>> And thank you for these details.
>>>>
>>>> Thanx, Paul
>>>
>
On 2022/6/20 下午3:50, Neeraj Upadhyay wrote:
> Hi,
>
>
> On 6/18/2022 8:37 AM, [email protected] wrote:
>>
>>
>> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>>> Hi,
>>>
>>> On 6/15/2022 2:33 PM, [email protected] wrote:
>>>>
>>>>
>>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, [email protected]
>>>>> wrote:
>>>>>>
>>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>>> 5.18-rc4 based ~8sec
>>>>>>>>
>>>>>>>> 5.19-rc1 ~2m43sec
>>>>>>>>
>>>>>>>> 5.19-rc1+fix1 ~19sec
>>>>>>>>
>>>>>>>> 5.19-rc1-fix2 ~19sec
>>>>>>>>
>>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or
>>>>>>> 5.19-rc1-fix2 ;
>>>>>>> does it show any difference in boot time?
>>>>>>>
>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>>>>>> srcu_struct
>>>>>>> *ssp, struct srcu_node *snp
>>>>>>> */
>>>>>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>> {
>>>>>>> - unsigned long cbdelay;
>>>>>>> + unsigned long cbdelay = 1;
>>>>>>> bool cbs;
>>>>>>> bool last_lvl;
>>>>>>> int cpu;
>>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct
>>>>>>> *ssp)
>>>>>>> spin_lock_irq_rcu_node(ssp);
>>>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>>> - cbdelay = !!srcu_get_delay(ssp);
>>>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>> + cbdelay = 0;
>>>>>>> +
>>>>>>> WRITE_ONCE(ssp->srcu_last_gp_end,
>>>>>>> ktime_get_mono_fast_ns());
>>>>> Thank you both for the testing and the proposed fix!
>>>>>
>>>>>> Test here:
>>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>> kernel:
>>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>>>>>>
>>>>>> (in case test patch not clear, push in git tree)
>>>>>>
>>>>>> Hardware: aarch64
>>>>>>
>>>>>> 1. 5.18-rc6
>>>>>> real 0m8.402s
>>>>>> user 0m3.015s
>>>>>> sys 0m1.102s
>>>>>>
>>>>>> 2. 5.19-rc1
>>>>>> real 2m41.433s
>>>>>> user 0m3.097s
>>>>>> sys 0m1.177s
>>>>>>
>>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>>> real 2m43.404s
>>>>>> user 0m2.880s
>>>>>> sys 0m1.214s
>>>>>>
>>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>>> real 0m15.262s
>>>>>> user 0m3.003s
>>>>>> sys 0m1.033s
>>>>>>
>>>>>> When build kernel in the meantime, load time become longer.
>>>>>>
>>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>>> real 0m15.215s
>>>>>> user 0m2.942s
>>>>>> sys 0m1.172s
>>>>>>
>>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>>>>>> real 1m23.936s
>>>>>> user 0m2.969s
>>>>>> sys 0m1.181s
>>>>> And thank you for the testing!
>>>>>
>>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>>
>>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>>> srcu_gp_end.
>>>>>
>>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>>> times start rising? This is probably best done by starting with
>>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>>> until boot time becomes slow, followed by a binary search between the
>>>>> last two values. (The idea is to bias the search so that fast boot
>>>>> times are the common case.)
>>>>
>>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>>
>>>>
>>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>
>>>> real 0m11.154s ~12s
>>>>
>>>> user 0m2.919s
>>>>
>>>> sys 0m1.064s
>>>>
>>>>
>>>>
>>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>>
>>>> real 0m11.258s
>>>>
>>>> user 0m3.113s
>>>>
>>>> sys 0m1.073s
>>>>
>>>>
>>>>
>>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>>
>>>> real 0m30.053s ~ 32s
>>>>
>>>> user 0m2.827s
>>>>
>>>> sys 0m1.161s
>>>>
>>>>
>>>>
>>>> By the way, if build kernel on the board in the meantime (using
>>>> memory), time become much longer.
>>>>
>>>> real 1m2.763s
>>>>
>>>>
>>>>
>>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>>
>>>> real 0m11.443s
>>>>
>>>> user 0m3.022s
>>>>
>>>> sys 0m1.052s
>>>>
>>>>
>>>
>>> This is useful data, thanks! Did you get chance to check between 100
>>> and 1000, to narrow down further, from which point (does need to be
>>> exact value) between 100 and 1000, you start seeing degradation at,
>>> for ex. 250, 500 , ...?
>>>
>>> Is it also possible to try experiment 10 and 11 with below diff.
>>> What I have done in below diff is, call srcu_get_delay() only once
>>> in try_check_zero() (and not for every loop iteration); also
>>> retry with a different delay for the extra iteration which is done
>>> when srcu_get_delay(ssp) returns 0.
>>>
>>> Once we have this data, can you also try by changing
>>> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>>>
>>> #define SRCU_RETRY_CHECK_LONG_DELAY 100
>>>
>>> -------------------------------------------------------------------------
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index 6a354368ac1d..3aff2f3e99ab 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>> * we repeatedly block for 1-millisecond time periods.
>>> */
>>> #define SRCU_RETRY_CHECK_DELAY 5
>>> +#define SRCU_RETRY_CHECK_LONG_DELAY 5
>>>
>>> /*
>>> * Start an SRCU grace period.
>>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct
>>> srcu_struct *ssp, struct srcu_data *sdp,
>>> */
>>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int
>>> trycount)
>>> {
>>> + unsigned long curdelay;
>>> + curdelay = !srcu_get_delay(ssp);
>>> for (;;) {
>>> if (srcu_readers_active_idx_check(ssp, idx))
>>> return true;
>>> - if (--trycount + !srcu_get_delay(ssp) <= 0)
>>> + if (--trycount + curdelay <= 0)
>>> return false;
>>> - udelay(SRCU_RETRY_CHECK_DELAY);
>>> + if (trycount)
>>> + udelay(SRCU_RETRY_CHECK_DELAY);
>>> + else
>>> + udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>>> }
>>> }
>>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>> real 0m11.443
>> s user 0m3.022
>> s sys 0m1.052s
>>
>> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
>>
>> real 0m11.401s
>> user 0m2.798s
>> sys 0m1.328s
>>
>>
>> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
>>
>> real 0m15.748s
>> user 0m2.781s
>> sys 0m1.294s
>>
>>
>> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
>>
>> real 0m20.704s 21
>> user 0m2.954s
>> sys 0m1.226s
>>
>> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
>>
>> real 0m25.151s
>> user 0m2.980s
>> sys 0m1.256s
>>
>>
>> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>> real 0m30.053s ~ 32s
>> user 0m2.827s
>> sys 0m1.161s
>>
>>
>> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
>>
>> real 0m19.263s
>> user 0m3.018s
>> sys 0m1.211s
>>
>>
>>
>> fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 +
>> SRCU_RETRY_CHECK_LONG_DELAY 100
>>
>> real 0m9.347s
>> user 0m3.132s
>> sys 0m1.041s
>>
>>
>
> Thanks. From the data and experiments done, looks to me that we get
> comparable (to 5.18-rc4 ) timings, when we retry without sleep for
> time duration close to 4-5 ms, which could be closer to the configured
> HZ (as it is 250)? Is it possible to try below configuration on top
> of fix15?
> If possible can you try with both HZ_1000 and HZ_250?
> As multiple fixes are getting combined in experiments, for clarity,
> please also share the diff of srcutree.c (on top of baseline) for all
> experiments.
>
> 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 1 jiffy)
>
>
> #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE 10
> #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
>
> 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
> 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 1/2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)
fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 1 jiffy)
real 0m10.120s
user 0m3.885s
sys 0m1.040s
fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 2 jiffy)
real 0m9.851s
user 0m3.886s
sys 0m1.011s
fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 1/2 jiffy)
real 0m9.741s
user 0m3.837s
sys 0m1.060s
code push to
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test