Date: Fri, 18 Aug 2017 09:57:58 -0400
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Cc: Lan Tianyu <tianyu.lan@intel.com>,
        David Hildenbrand <david@redhat.com>, pbonzini@redhat.com,
        tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, x86@kernel.org,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM/x86: Increase max vcpu number to 352
Message-ID: <20170818135758.GE11671@char.us.oracle.com>
References: <1502359259-24966-1-git-send-email-tianyu.lan@intel.com>
 <20170810175056.GR2547@char.us.oracle.com>
 <23159a7e-463a-2a5b-5aaa-ef7f0eb43547@intel.com>
 <aab6d4cc-770c-4c49-c587-ff693e23388f@redhat.com>
 <20170811130020.GB28649@flask>
 <20170811193531.GM32249@dhcp-amer-vpn-adc-anyconnect-10-154-152-169.vpn.oracle.com>
 <323bcdf0-4f4c-5c24-fe8e-f2f773b58370@intel.com>
 <20170815141046.GN20279@char.us.oracle.com>
 <20170815161328.GB5975@flask>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20170815161328.GB5975@flask>
User-Agent: Mutt/1.8.3 (2017-05-23)
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2784
Lines: 67

On Tue, Aug 15, 2017 at 06:13:29PM +0200, Radim Krčmář wrote:
> (Missed this mail before my last reply.)
> 
> 2017-08-15 10:10-0400, Konrad Rzeszutek Wilk:
> > On Tue, Aug 15, 2017 at 11:00:04AM +0800, Lan Tianyu wrote:
> > > On 2017年08月12日 03:35, Konrad Rzeszutek Wilk wrote:
> > > > Migration with 352 CPUs all being busy dirtying memory and also poking
> > > > at various I/O ports (say all of them dirtying the VGA) is no problem?
> > > 
> > > This depends on what kind of workload is running during migration. I
> > > think this may affect service down time since there maybe a lot of dirty
> > > memory data to transfer after stopping vcpus. This also depends on how
> > > user sets "migrate_set_downtime" for qemu. But I think increasing vcpus
> > > will break migration function.
> > 
> > OK, so let me take a step back.
> > 
> > I see this nice 'supported' CPU count that is exposed in kvm module.
> > 
> > Then there is QEMU throwing out a warning if you crank up the CPU count
> > above that number.
> 
> I find the range between "recommended max" and "hard max" VCPU count
> confusing at best ... IIUC, it was there because KVM internals had
> problems with scaling and we will hit more in the future because some
> loops still are linear on VCPU count.

Is that documented somewhere? There are some folks would be interested
in looking at that if it was known what exactly to look for..

> 
> The exposed value doesn't say whether migration will work, because that
> is a userspace thing and we're not aware of bottlenecks on the KVM side.
> 
> > Red Hat's web-pages talk about CPU count as well.
> > 
> > And I am assuming all of those are around what has been tested and
> > what has shown to work. And one of those test-cases surely must
> > be migration.
> 
> Right, Red Hat will only allow/support what it has tested, even if
> upstream has a practically unlimited count.  I think the upstream number
> used to be raised by Red Hat, which is why upstream isn't at the hard
> implementation limit ...

Aim for the sky! Perhaps then lets crank it up to 4096 upstream and let
each vendor/distro/cloud decide the right number based on their
testing.

And also have more folks report issues as they try running say running
these huge vCPU guests?

> 
> > Ergo, if the vCPU count increase will break migration, then it is
> > a regression.
> 
> Raising the limit would not break existing guests, but I would rather
> avoid adding higher VCPU count as a feature that disables migration.
> 
> > Or a fix/work needs to be done to support a higher CPU count for
> > migrating?
> 
> Post-copy migration should handle higher CPU count and it is the default
> fallback on QEMU.  Asking the question on a userspace list would yield
> better answers, though.
> 
> Thanks.