Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp258130iog; Wed, 15 Jun 2022 01:31:39 -0700 (PDT) X-Google-Smtp-Source: AGRyM1sDMvVjABDyWNa/8tVT0/ws/3/3wME0niGpEhLtO5mKBZS3RJcv8Qy3brz0q38CSBlyS3HE X-Received: by 2002:a17:903:2287:b0:164:95f:b4d6 with SMTP id b7-20020a170903228700b00164095fb4d6mr8438243plh.140.1655281898962; Wed, 15 Jun 2022 01:31:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655281898; cv=none; d=google.com; s=arc-20160816; b=vCkkak645FVKOm3uzLE+Akg7EeSLubT0teS4h7YJD5wVLTV1PEKVbUwdmKW2fqxF9Z 3e5rBiS1x3mADCuW5fit/BIKIP8x8YsRWCoIWEHbNX2peYapVaI04RLrTSXksLBfNVEE zAH7cdPfHWGqlkSr1sx61N1B5Cok90uBm39Vk0+OJKXNktc/aonIjLjw4f83Z7SekArv +J6BRBxAWt0qVcOjFysLN9YqRBdHRhy52Brj9ogqf14JNZPsuzCTysYqSZl7jAiNDdzf mHC5ZT/lu6j5nq8K2Z72vZOkiocZ/jnWw9w7yMw9IJFow+mSFDuMFc0mo7iIL7NewfQu HMFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:subject:cc:to:from:message-id :date:dkim-signature; bh=N3AQXzY/B9bmcWvPpMs7xAQnJEzrjRcWQPIQwNf8Cbk=; b=boHSVP92LyZDx0lwaYfa37pTZhUJxPuE35XX9yd6F8lsE0cbKc1g+XXS3eY+c9GGCM uFE3iNJmgLa8uxbDBeJMW+7L/TxIP67KqC+uixmnufBVo41b+L5MOtu/TiynXZ9kW80f mny8HWvYqyPYifpTYxtR53FEpVvAiDBssFeYV8KcxlR+BdopB/AFkBbUJWBZV/7D8kWA wTrYjOnOtCPrvRmumX9d3JpptM+Y2mWzd5VsKU3VXj0pTJpzs0HHHFE1CpC0oGIxglwP dHT6k4h7ChBK2iXoM8VFo/b1gVuHTINjKiwOhICbgos8nTCl042YZ51bRYhyLMUhqMpy gDLA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=A3sPts4a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p28-20020a635b1c000000b0040242f36212si16841943pgb.287.2022.06.15.01.31.22; Wed, 15 Jun 2022 01:31:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=A3sPts4a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344527AbiFOI3n (ORCPT + 99 others); Wed, 15 Jun 2022 04:29:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345045AbiFOI3T (ORCPT ); Wed, 15 Jun 2022 04:29:19 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0950433890; Wed, 15 Jun 2022 01:29:17 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id A14E1B81C77; Wed, 15 Jun 2022 08:29:15 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0A2A8C34115; Wed, 15 Jun 2022 08:29:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1655281754; bh=SIjk+9QLqP1zmMoa2vUTvhx/gxU5zCEUB3KcWNjvHnQ=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=A3sPts4aHKxMqB/bxT+pWKfaEXUGZIGa6NkS6cBaVjplFHQdHlO3G4Tugqnx/ETQZ 4LYSJ0s26JyntUw5D/aE36uMyQNgjaGfEJuBBrrYuwT0WF2U+dsi8k/9zZPjJgVES7 NT1WAG05Mn2DO+YFahbXnPfRTTjB3j+iJID8YuQT3nui9eSS+0una76h+0ahYPcDUm h3T5HX21YG7btIqqrEWzO+tCz9eBegxtXEOHTp2/fc7NFlXe2+SmG8jyLd4zCkwFek 7Ky7mQUL9WUiJyeUohMHzsEYgQyQrsH9AIs6m19ztIQ0Mu1Gm4+9/exKC5J4KG0BAy DYcFaNwmsYFvQ== Received: from sofa.misterjones.org ([185.219.108.64] helo=why.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1o1OOl-000j94-NS; Wed, 15 Jun 2022 09:29:11 +0100 Date: Wed, 15 Jun 2022 09:29:11 +0100 Message-ID: <8735g649ew.wl-maz@kernel.org> From: Marc Zyngier To: "zhangfei.gao@foxmail.com" , "Paul E. McKenney" Cc: Paolo Bonzini , Zhangfei Gao , "linux-kernel@vger.kernel.org" , rcu@vger.kernel.org, Lai Jiangshan , Josh Triplett , Mathieu Desnoyers , Matthew Wilcox , Shameerali Kolothum Thodi , mtosatti@redhat.com, Auger Eric Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow In-Reply-To: References: <20220611165956.GO1790663@paulmck-ThinkPad-P17-Gen-1> <20220612133631.GQ1790663@paulmck-ThinkPad-P17-Gen-1> <20220612162029.GR1790663@paulmck-ThinkPad-P17-Gen-1> <20220612164002.GA1242564@paulmck-ThinkPad-P17-Gen-1> <560f7d27-fe38-0db9-834a-50dda5fa6157@redhat.com> <20220612184944.GU1790663@paulmck-ThinkPad-P17-Gen-1> <20220613035711.GY1790663@paulmck-ThinkPad-P17-Gen-1> <20220613041652.GA3976000@paulmck-ThinkPad-P17-Gen-1> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: zhangfei.gao@foxmail.com, paulmck@kernel.org, pbonzini@redhat.com, zhangfei.gao@linaro.org, linux-kernel@vger.kernel.org, rcu@vger.kernel.org, jiangshanlai@gmail.com, josh@joshtriplett.org, mathieu.desnoyers@efficios.com, willy@infradead.org, shameerali.kolothum.thodi@huawei.com, mtosatti@redhat.com, eric.auger@redhat.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 13 Jun 2022 07:55:47 +0100, "zhangfei.gao@foxmail.com" wrote: >=20 > Hi, Paul >=20 > On 2022/6/13 =E4=B8=8B=E5=8D=8812:16, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@foxmail.com wro= te: > >>> Hi, Paul > >>>=20 > >>> On 2022/6/13 =E4=B8=8A=E5=8D=882:49, Paul E. McKenney wrote: > >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > >>>>> On 6/12/22 18:40, Paul E. McKenney wrote: > >>>>>>> Do these reserved memory regions really need to be allocated sepa= rately? > >>>>>>> (For example, are they really all non-contiguous? If not, that i= s, if > >>>>>>> there are a lot of contiguous memory regions, could you sort the = IORT > >>>>>>> by address and do one ioctl() for each set of contiguous memory r= egions?) > >>>>>>>=20 > >>>>>>> Are all of these reserved memory regions set up before init is sp= awned? > >>>>>>>=20 > >>>>>>> Are all of these reserved memory regions set up while there is on= ly a > >>>>>>> single vCPU up and running? > >>>>>>>=20 > >>>>>>> Is the SRCU grace period really needed in this case? (I freely c= onfess > >>>>>>> to not being all that familiar with KVM.) > >>>>>> Oh, and there was a similar many-requests problem with networking = many > >>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatev= er > >>>>>> that permitted many requests to be presented to the kernel with a = single > >>>>>> system call. > >>>>>>=20 > >>>>>> Could a new ioctl() be introduced that requested a large number > >>>>>> of these memory regions in one go so as to make each call to > >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ > >>>>>> requests? Adding a few of the KVM guys on CC for their thoughts. > >>>>> Unfortunately not. Apart from this specific case, in general the c= alls to > >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers= in the > >>>>> guest, and those writes then map to a ioctl. Typically the guest s= ets up a > >>>>> device at a time, and each setup step causes a synchronize_srcu()--= -and > >>>>> expedited at that. > >>>> I was afraid of something like that... > >>>>=20 > >>>>> KVM has two SRCUs: > >>>>>=20 > >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has = readers > >>>>> that are very very small, but it needs extremely fast detection of = grace > >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Re= aders are > >>>>> not so frequent. > >>>>>=20 > >>>>> 2) kvm->srcu is nastier because there are readers all the time. The > >>>>> read-side critical section are still short-ish, but they need the s= leepable > >>>>> part because they access user memory. > >>>> Which one of these two is in play in this case? > >>>>=20 > >>>>> Writers are not frequent per se; the problem is they come in very l= arge > >>>>> bursts when a guest boots. And while the whole boot path overall c= an be > >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a la= rger > >>>>> impact on runtime than the O(n^2) parts, as demonstrated here. > >>>>>=20 > >>>>> Therefore, we operated on the assumption that the callers of > >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound gu= est code > >>>>> and the desire was to get past the booting phase as fast as possibl= e. If > >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it want= s; > >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the = whole > >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() we= re not > >>>>> problematic. > >>>>>=20 > >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->ir= q_srcu > >>>>> were was introduced (respectively in 2009 and 2014). But perhaps t= hey do > >>>>> not hold anymore now that each SRCU is not as independent as it use= d to be > >>>>> in those years, and instead they use workqueues instead? > >>>> The problem was not internal to SRCU, but rather due to the fact > >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks > >>>> resulting from repeated synchronize_rcu_expedited() invocations. So= I > >>>> added heuristics to get the occasional sleep in there for KLP's bene= fit. > >>>> Perhaps these heuristics need to be less aggressive about adding sle= ep. > >>>>=20 > >>>> These heuristics have these aspects: > >>>>=20 > >>>> 1. The longer readers persist in an expedited SRCU grace period, > >>>> the longer the wait between successive checks of the reader > >>>> state. Roughly speaking, we wait as long as the grace period > >>>> has currently been in effect, capped at ten jiffies. > >>>>=20 > >>>> 2. SRCU grace periods have several phases. We reset so that each > >>>> phase starts by not waiting (new phase, new set of readers, > >>>> so don't penalize this set for the sins of the previous set). > >>>> But once we get to the point of adding delay, we add the > >>>> delay based on the beginning of the full grace period. > >>>>=20 > >>>> Right now, the checking for grace-period length does not allow for t= he > >>>> possibility that a grace period might start just before the jiffies > >>>> counter gets incremented (because I didn't realize that anyone cared= ), > >>>> so that is one possible thing to change. I can also allow more no-d= elay > >>>> checks per SRCU grace-period phase. > >>>>=20 > >>>> Zhangfei, does something like the patch shown below help? > >>>>=20 > >>>> Additional adjustments are likely needed to avoid re-breaking KLP, > >>>> but we have to start somewhere... > >>>>=20 > >>>> Thanx, Paul > >>>>=20 > >>>> --------------------------------------------------------------------= ---- > >>>>=20 > >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > >>>> index 50ba70f019dea..6a354368ac1d1 100644 > >>>> --- a/kernel/rcu/srcutree.c > >>>> +++ b/kernel/rcu/srcutree.c > >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_stru= ct *ssp) > >>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pendin= g. > >>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from sl= ow readers. > >>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutiv= e no-delay instances. > >>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutiv= e no-delay instances. > >>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay inst= ances. > >>>> /* > >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_st= ruct *ssp) > >>>> */ > >>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) > >>>> { > >>>> + unsigned long gpstart; > >>>> + unsigned long j; > >>>> unsigned long jbase =3D SRCU_INTERVAL; > >>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->src= u_gp_seq_needed_exp))) > >>>> jbase =3D 0; > >>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > >>>> - jbase +=3D jiffies - READ_ONCE(ssp->srcu_gp_start); > >>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > >>>> + j =3D jiffies - 1; > >>>> + gpstart =3D READ_ONCE(ssp->srcu_gp_start); > >>>> + if (time_after(j, gpstart)) > >>>> + jbase +=3D j - gpstart; > >>>> + } > >>>> if (!jbase) { > >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_n= odelay) + 1); > >>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > >>> Unfortunately, this patch does not helpful. > >>>=20 > >>> Then re-add the debug info. > >>>=20 > >>> During the qemu boot > >>> [=C2=A0 232.997667]=C2=A0 __synchronize_srcu loop=3D1000 > >>>=20 > >>> [=C2=A0 361.094493]=C2=A0 __synchronize_srcu loop=3D9000 > >>> [=C2=A0 361.094501] Call trace: > >>> [=C2=A0 361.094502]=C2=A0 dump_backtrace+0xe4/0xf0 > >>> [=C2=A0 361.094505]=C2=A0 show_stack+0x20/0x70 > >>> [=C2=A0 361.094507]=C2=A0 dump_stack_lvl+0x8c/0xb8 > >>> [=C2=A0 361.094509]=C2=A0 dump_stack+0x18/0x34 > >>> [=C2=A0 361.094511]=C2=A0 __synchronize_srcu+0x120/0x128 > >>> [=C2=A0 361.094514]=C2=A0 synchronize_srcu_expedited+0x2c/0x40 > >>> [=C2=A0 361.094515]=C2=A0 kvm_swap_active_memslots+0x130/0x198 > >>> [=C2=A0 361.094519]=C2=A0 kvm_activate_memslot+0x40/0x68 > >>> [=C2=A0 361.094520]=C2=A0 kvm_set_memslot+0x2f8/0x3b0 > >>> [=C2=A0 361.094523]=C2=A0 __kvm_set_memory_region+0x2e4/0x438 > >>> [=C2=A0 361.094524]=C2=A0 kvm_set_memory_region+0x78/0xb8 > >>> [=C2=A0 361.094526]=C2=A0 kvm_vm_ioctl+0x5a0/0x13e0 > >>> [=C2=A0 361.094528]=C2=A0 __arm64_sys_ioctl+0xb0/0xf8 > >>> [=C2=A0 361.094530]=C2=A0 invoke_syscall+0x4c/0x110 > >>> [=C2=A0 361.094533]=C2=A0 el0_svc_common.constprop.0+0x68/0x128 > >>> [=C2=A0 361.094536]=C2=A0 do_el0_svc+0x34/0xc0 > >>> [=C2=A0 361.094538]=C2=A0 el0_svc+0x30/0x98 > >>> [=C2=A0 361.094541]=C2=A0 el0t_64_sync_handler+0xb8/0xc0 > >>> [=C2=A0 361.094544]=C2=A0 el0t_64_sync+0x18c/0x190 > >>> [=C2=A0 363.942817]=C2=A0 kvm_set_memory_region loop=3D6000 > >> Huh. > >>=20 > >> One possibility is that the "if (!jbase)" block needs to be nested > >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. >=20 > I test this diff and NO helpful >=20 > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 50ba70f019de..36286a4b74e6 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *s= sp) >=20 > =C2=A0#define SRCU_INTERVAL=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 // Base delay if no expedit= ed > GPs pending. > =C2=A0#define SRCU_MAX_INTERVAL=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 10=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 // Maximum incremental delay > from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 // = Maximum per-GP-phase > consecutive no-delay instances. > +#define SRCU_MAX_NODELAY_PHASE 3=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 // = Maximum per-GP-phase > consecutive no-delay instances. > =C2=A0#define SRCU_MAX_NODELAY=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 100=C2= =A0=C2=A0=C2=A0=C2=A0 // Maximum consecutive > no-delay instances. >=20 > =C2=A0/* > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > =C2=A0 */ > =C2=A0static unsigned long srcu_get_delay(struct srcu_struct *ssp) > =C2=A0{ > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long gpstart; > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long j; > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long jbase =3D SRCU_I= NTERVAL; >=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ULONG_CMP_LT(READ_ONCE(ssp= ->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 jbase =3D 0; > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (rcu_seq_state(READ_ONCE(ssp->sr= cu_gp_seq))) > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 jbase +=3D jiffies - READ_ONCE(ssp->srcu_gp_start); > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!jbase) { > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 jbase =3D 1; > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (rcu_seq_state(READ_ONCE(ssp->sr= cu_gp_seq))) { > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 j =3D jiffies - 1; > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 gpstart =3D READ_ONCE(ssp->srcu_gp_start); > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (time_after(j, gpstart)) > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 jbase +=3D j -= gpstart; > + > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (!jbase) { > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WRITE_ONCE(ssp= ->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (READ_ONCE(= ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 jbase =3D 1; > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 } > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >=20 > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > > above change reduces the overhead by more than an order of magnitude. > > Except that the overhead of the series is far less than one second, > > not the several minutes that you are seeing. So the per-call overhead > > decreases from about 17 microseconds to a bit more than one microsecond. > >=20 > > I could imagine an extra order of magnitude if you are running HZ=3D100 > > instead of the HZ=3D1000 that I am running. But that only gets up to a > > few seconds. > >=20 > >> One additional debug is to apply the patch below on top of the one you > apply the patch below? > >> just now kindly tested, then use whatever debug technique you wish to > >> work out what fraction of the time during that critical interval that > >> srcu_get_delay() returns non-zero. > Sorry, I am confused, no patch right? > Just measure srcu_get_delay return to non-zero? >=20 >=20 > By the way, the issue should be only related with qemu apci. not > related with rmr feature No, this also occurs if you supply the guest's EFI with an empty set of persistent variables. EFI goes and zeroes it, which results in a read-only memslot write access being taken to userspace, the memslot being unmapped from the guest, QEMU doing a little dance, and eventually restoring the memslot back to the guest. Rince, repeat. Do that one byte at a time over 64MB, and your boot time for EFI only goes from 39s to 3m50s (that's on a speed-challenged Synquacer box), which completely kills the "deploy a new VM" use case. Thanks, M. --=20 Without deviation from the norm, progress is not possible.