Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   Vitaly Kuznetsov <vkuznets@redhat.com>
To:     Michael Kelley <mikelley@microsoft.com>
Cc:     KY Srinivasan <kys@microsoft.com>,
        Haiyang Zhang <haiyangz@microsoft.com>,
        Stephen Hemminger <sthemmin@microsoft.com>,
        "wei.liu@kernel.org" <wei.liu@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
        Dexuan Cui <decui@microsoft.com>
Subject: RE: ** POTENTIAL FRAUD ALERT - RED HAT ** [PATCH v2 1/1] Drivers:
 hv: vmbus: Increase wait time for VMbus unload
In-Reply-To: <MWHPR21MB15937990D10174A63E65F579D7489@MWHPR21MB1593.namprd21.prod.outlook.com>
References: <1618894089-126662-1-git-send-email-mikelley@microsoft.com>
 <87tuo1i9o5.fsf@vitty.brq.redhat.com>
 <MWHPR21MB15937990D10174A63E65F579D7489@MWHPR21MB1593.namprd21.prod.outlook.com>
Date:   Wed, 21 Apr 2021 10:24:53 +0200
Message-ID: <875z0ghwoa.fsf@vitty.brq.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Michael Kelley <mikelley@microsoft.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Tuesday, April 20, 2021 2:32 AM
>> 
>> Michael Kelley <mikelley@microsoft.com> writes:
>> 
>> > When running in Azure, disks may be connected to a Linux VM with
>> > read/write caching enabled. If a VM panics and issues a VMbus
>> > UNLOAD request to Hyper-V, the response is delayed until all dirty
>> > data in the disk cache is flushed.  In extreme cases, this flushing
>> > can take 10's of seconds, depending on the disk speed and the amount
>> > of dirty data. If kdump is configured for the VM, the current 10 second
>> > timeout in vmbus_wait_for_unload() may be exceeded, and the UNLOAD
>> > complete message may arrive well after the kdump kernel is already
>> > running, causing problems.  Note that no problem occurs if kdump is
>> > not enabled because Hyper-V waits for the cache flush before doing
>> > a reboot through the BIOS/UEFI code.
>> >
>> > Fix this problem by increasing the timeout in vmbus_wait_for_unload()
>> > to 100 seconds. Also output periodic messages so that if anyone is
>> > watching the serial console, they won't think the VM is completely
>> > hung.
>> >
>> > Fixes: 911e1987efc8 ("Drivers: hv: vmbus: Add timeout to vmbus_wait_for_unload")
>> > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
>> > ---
>> >
>> > Changed in v2: Fixed silly error in the argument to mdelay()
>> >
>> > ---
>> >  drivers/hv/channel_mgmt.c | 30 +++++++++++++++++++++++++-----
>> >  1 file changed, 25 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
>> > index f3cf4af..ef4685c 100644
>> > --- a/drivers/hv/channel_mgmt.c
>> > +++ b/drivers/hv/channel_mgmt.c
>> > @@ -755,6 +755,12 @@ static void init_vp_index(struct vmbus_channel *channel)
>> >  	free_cpumask_var(available_mask);
>> >  }
>> >
>> > +#define UNLOAD_DELAY_UNIT_MS	10		/* 10 milliseconds */
>> > +#define UNLOAD_WAIT_MS		(100*1000)	/* 100 seconds */
>> > +#define UNLOAD_WAIT_LOOPS	(UNLOAD_WAIT_MS/UNLOAD_DELAY_UNIT_MS)
>> > +#define UNLOAD_MSG_MS		(5*1000)	/* Every 5 seconds */
>> > +#define UNLOAD_MSG_LOOPS	(UNLOAD_MSG_MS/UNLOAD_DELAY_UNIT_MS)
>> > +
>> >  static void vmbus_wait_for_unload(void)
>> >  {
>> >  	int cpu;
>> > @@ -772,12 +778,17 @@ static void vmbus_wait_for_unload(void)
>> >  	 * vmbus_connection.unload_event. If not, the last thing we can do is
>> >  	 * read message pages for all CPUs directly.
>> >  	 *
>> > -	 * Wait no more than 10 seconds so that the panic path can't get
>> > -	 * hung forever in case the response message isn't seen.
>> > +	 * Wait up to 100 seconds since an Azure host must writeback any dirty
>> > +	 * data in its disk cache before the VMbus UNLOAD request will
>> > +	 * complete. This flushing has been empirically observed to take up
>> > +	 * to 50 seconds in cases with a lot of dirty data, so allow additional
>> > +	 * leeway and for inaccuracies in mdelay(). But eventually time out so
>> > +	 * that the panic path can't get hung forever in case the response
>> > +	 * message isn't seen.
>> 
>> I vaguely remember debugging cases when CHANNELMSG_UNLOAD_RESPONSE never
>> arrives, it was kind of pointless to proceed to kexec as attempts to
>> reconnect Vmbus devices were failing (no devices were offered after
>> CHANNELMSG_REQUESTOFFERS AFAIR). Would it maybe make sense to just do
>> emergency reboot instead of proceeding to kexec when this happens? Just
>> wondering.
>> 
>
> Yes, I think there have been (and maybe still are) situations where we don't
> ever get the UNLOAD response.  But there have been bugs fixed in Hyper-V
> that I think make that less likely.  There's also an unfixed (and maybe not fixable)
> problem when not operating in STIMER Direct Mode, where an old-style
> timer message can block the UNLOAD response message.  But as the world
> moves forward to later kernel versions that use STIMER Direct Mode, that
> also becomes less likely.   So my inclination is to let execution continue on
> the normal execution path, even if the UNLOAD response message isn't
> received.  Maybe we just didn't wait quite long enough (even at 100 seconds).
> It's a judgment call, and it's not clear to me that doing an emergency reboot
> is really any better.
>
> As background work for this patch, we also discovered another bug in Hyper-V.
> If the kdump kernel runs and does a VMbus INITIATE_CONTACT while the
> UNLOAD is still in progress, the Hyper-V code is supposed to wait for the UNLOAD
> to complete, and then commence the VMbus version negotiation.  But it
> doesn't do that -- it finally sends the UNLOAD response, but never does the
> version negotiation, so the kdump kernel hangs forever.  The Hyper-V team
> plans to fix this, and hopefully we'll get a patch deployed in Azure, which
> will eliminate one more scenario where the kdump kernel doesn't succeed.
>

Ah, ok, if bugs in Hyper-V/Azure are being fixed then it seems
reasonable to keep the current logic (proceeding to kexec even when we
didn't receive UNLOAD). Thanks for the additional info!

-- 
Vitaly