Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752241AbdHHMtM (ORCPT ); Tue, 8 Aug 2017 08:49:12 -0400 Received: from mga14.intel.com ([192.55.52.115]:11945 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752176AbdHHMtK (ORCPT ); Tue, 8 Aug 2017 08:49:10 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,343,1498546800"; d="scan'208";a="297341112" Subject: Re: [RFC]Add new mdev interface for QoS To: Kirti Wankhede , Alex Williamson References: <9951f9cf-89dd-afa4-a9f7-9a795e4c01af@intel.com> <20170726104343.5bfa51d5@w520.home> <9607b33d-7b3a-1bcf-1ad9-4b554100e68a@intel.com> <20170801162625.6264dbd6@w520.home> <0f637a9b-8b74-8b50-6611-2eb2557a80d6@nvidia.com> <461872b1-1086-5151-1473-734223b050d0@intel.com> <20170802105845.717ecf5f@w520.home> <20170803151155.35c650cb@w520.home> <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Tian, Kevin" , Zhenyu Wang , Jike Song , libvir-list@redhat.com, zhi.a.wang@intel.com From: "Gao, Ping A" Message-ID: <9a57d996-6d0c-daec-98b8-b31ab6acb989@intel.com> Date: Tue, 8 Aug 2017 20:48:45 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10307 Lines: 181 On 2017/8/8 14:42, Kirti Wankhede wrote: > > On 8/7/2017 1:11 PM, Gao, Ping A wrote: >> On 2017/8/4 5:11, Alex Williamson wrote: >>> On Thu, 3 Aug 2017 20:26:14 +0800 >>> "Gao, Ping A" wrote: >>> >>>> On 2017/8/3 0:58, Alex Williamson wrote: >>>>> On Wed, 2 Aug 2017 21:16:28 +0530 >>>>> Kirti Wankhede wrote: >>>>> >>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote: >>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote: >>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote: >>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800 >>>>>>>>> "Gao, Ping A" wrote: >>>>>>>>> >>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote: >>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote: >>>>>>>>>>>> [cc +libvir-list] >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800 >>>>>>>>>>>> "Gao, Ping A" wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the >>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a >>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS >>>>>>>>>>>>> related interface for mdev to management virtual device resource. >>>>>>>>>>>>> >>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has >>>>>>>>>>>>> different performance requirements, some guests may need higher priority >>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU >>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some >>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for >>>>>>>>>>>>> single submission control. >>>>>>>>>>>>> >>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in >>>>>>>>>>>>> mdev core sysfs for QoS purpose. >>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS >>>>>>>>>>>> attribute_group which a vendor can optionally include within the >>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups. The mdev core will >>>>>>>>>>>> transparently enable this, but it really only provides the standard, >>>>>>>>>>>> all of the support code is left for the vendor. I'm fine with that, >>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving >>>>>>>>>>>> at an agreed upon standard. Are there QoS knobs that are generic >>>>>>>>>>>> across any mdev device type? Are there others that are more specific >>>>>>>>>>>> to vGPU? Are there existing examples of this that we can steal their >>>>>>>>>>>> specification? >>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted. >>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS >>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only >>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm >>>>>>>>>>> in their back-end driver. >>>>>>>>>>> >>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack >>>>>>>>>>> of HW virtualization support to guests, no matter the device type, >>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this >>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the >>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can >>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW >>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem >>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their >>>>>>>>>>> specification to reach the same performance expectation. Seems there are >>>>>>>>>>> no examples for us to follow, we need define it from scratch. >>>>>>>>>>> >>>>>>>>>>> I proposal universal QoS control interfaces like below: >>>>>>>>>>> >>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own >>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of >>>>>>>>>>> total physical resource. >>>>>>>>>>> >>>>>>>>>>> Weight: The weight define proportional control of the mdev device >>>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load >>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource >>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1. >>>>>>>>>>> >>>>>>>>>>> Priority: The guest who has higher priority will get execution first, >>>>>>>>>>> target to some real time usage and speeding interactive response. >>>>>>>>>>> >>>>>>>>>>> Above QoS interfaces cover both overall budget control and single >>>>>>>>>>> submission control. I will sent out detail design later once get aligned. >>>>>>>>>> Hi Alex, >>>>>>>>>> Any comments about the interface mentioned above? >>>>>>>>> Not really. >>>>>>>>> >>>>>>>>> Kirti, are there any QoS knobs that would be interesting >>>>>>>>> for NVIDIA devices? >>>>>>>>> >>>>>>>> We have different types of vGPU for different QoS factors. >>>>>>>> >>>>>>>> When mdev devices are created, its resources are allocated irrespective >>>>>>>> of which VM/userspace app is going to use that mdev device. Any >>>>>>>> parameter we add here should be tied to particular mdev device and not >>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are >>>>>>>> along that line. All mdev device might not need/use these parameters, >>>>>>>> these can be made optional interfaces. >>>>>>> We also define some QoS parameters in Intel vGPU types, but it only >>>>>>> provided a default fool-style way. We still need a flexible approach >>>>>>> that give user the ability to change QoS parameters freely and >>>>>>> dynamically according to their requirement , not restrict to the current >>>>>>> limited and static vGPU types. >>>>>>> >>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev >>>>>>>> devices on same physical device. >>>>>>>> >>>>>>>> In the above example, "if guest 1 should take double mdev device >>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how >>>>>>>> will you calculate resources? >>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a >>>>>>> vertical limitation, but weight is a horizontal limitation that define >>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to >>>>>>> understand as it's just a percentage. For weight. for example, if we >>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been >>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4, >>>>>>> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4) * >>>>>>> total_physical_GPU_resource. >>>>>>> >>>>>> How will vendor driver provide max weight to userspace >>>>>> application/libvirt? Max weight will be per physical device, right? >>>>>> >>>>>> How would such resource allocation reflect in 'available_instances'? >>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with >>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G >>>>>> FB free but you have reached max weight, so will you make >>>>>> available_instances = 0 for all types on that physical GPU? >>>>> No, per the algorithm above, the available scheduling for the remaining >>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16, >>>>> we'd need to define or make the range discoverable, 16 seems rather >>>>> arbitrary). We can always add new scheduling participants. AIUI, >>>>> Intel uses round-robin scheduling now, where you could consider all >>>>> mdev devices to have the same weight. Whether we consider that to be a >>>>> weight of 16 or zero or 8 doesn't really matter. >>>> QoS is to control the device's process capability like GPU >>>> rendering/computing that can be time multiplexing, not used to control >>>> the dedicated partition resources like FB, so there is no impact on >>>> 'available_instances'. >>>> >>>> if vGPU_1 weight=8, vGPU_2 weight=4; >>>> then vGPU_1_res = 8 / (8 + 4) * total, vGPU_2_res = 4 / (8 + 4) * total; >>>> if vGPU_3 created with weight 2; >>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) * >>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total. >>>> >>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically >>>> changed after vGPU_3 creating, that's weight doing as it's to define the >>>> relationship of all the vGPUs, the performance degradation is meet >>>> expectation. The end-user should know about such behavior. >>>> >>>> However the argument on weight let me has some self-reflection, does the >>>> end-user real need weight? does weight has actually application >>>> requirement? Maybe the cap and priority are enough? >>> What sort of SLAs do you want to be able to offer? For instance if I >>> want to be able to offer a GPU in 1/4 increments, how does that work? >>> I might sell customers A & B 1/4 increment each and customer C a 1/2 >>> increment. If weight is removed, can we do better than capping A & B >>> at 25% each and C at 50%? That has the downside that nobody gets to >>> use the unused capacity of the other clients. The SLA is some sort of >>> "up to X% (and no more)" model. With weighting it's as simple as making >>> sure customer C's vGPU has twice the weight of that given to A or B. >>> Then you get an "at least X%" SLA model and any customer can use up to >>> 100% if the others are idle. Combining weight and cap, we can do "at >>> least X%, but no more than Y%". >>> >>> All of this feels really similar to how cpusets must work since we're >>> just dealing with QoS relative to scheduling and we should not try to >>> reinvent scheduling QoS. Thanks, >>> >> Yeah, that's also my original thoughts. >> Since we get aligned about the QoS basic definition, I'm going to >> prepare the code in kernel side. How about the corresponding part in >> libvirt? Implemented separately after the kernel interface finalizing? >> > Ok. These interfaces should be optional since all vendors drivers of > mdev may not support such QoS. > Sure, all of them are optional, it's freely to choose or even not to choose. Thanks, Ping