Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752218AbdHCM0d (ORCPT ); Thu, 3 Aug 2017 08:26:33 -0400 Received: from mga07.intel.com ([134.134.136.100]:5108 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbdHCM0c (ORCPT ); Thu, 3 Aug 2017 08:26:32 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,315,1498546800"; d="scan'208";a="295228152" Subject: Re: [RFC]Add new mdev interface for QoS To: Alex Williamson , Kirti Wankhede References: <9951f9cf-89dd-afa4-a9f7-9a795e4c01af@intel.com> <20170726104343.5bfa51d5@w520.home> <9607b33d-7b3a-1bcf-1ad9-4b554100e68a@intel.com> <20170801162625.6264dbd6@w520.home> <0f637a9b-8b74-8b50-6611-2eb2557a80d6@nvidia.com> <461872b1-1086-5151-1473-734223b050d0@intel.com> <20170802105845.717ecf5f@w520.home> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Tian, Kevin" , Zhenyu Wang , Jike Song , libvir-list@redhat.com, zhi.a.wang@intel.com From: "Gao, Ping A" Message-ID: Date: Thu, 3 Aug 2017 20:26:14 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <20170802105845.717ecf5f@w520.home> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10821 Lines: 196 On 2017/8/3 0:58, Alex Williamson wrote: > On Wed, 2 Aug 2017 21:16:28 +0530 > Kirti Wankhede wrote: > >> On 8/2/2017 6:29 PM, Gao, Ping A wrote: >>> On 2017/8/2 18:19, Kirti Wankhede wrote: >>>> On 8/2/2017 3:56 AM, Alex Williamson wrote: >>>>> On Tue, 1 Aug 2017 13:54:27 +0800 >>>>> "Gao, Ping A" wrote: >>>>> >>>>>> On 2017/7/28 0:00, Gao, Ping A wrote: >>>>>>> On 2017/7/27 0:43, Alex Williamson wrote: >>>>>>>> [cc +libvir-list] >>>>>>>> >>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800 >>>>>>>> "Gao, Ping A" wrote: >>>>>>>> >>>>>>>>> The vfio-mdev provide the capability to let different guest share the >>>>>>>>> same physical device through mediate sharing, as result it bring a >>>>>>>>> requirement about how to control the device sharing, we need a QoS >>>>>>>>> related interface for mdev to management virtual device resource. >>>>>>>>> >>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has >>>>>>>>> different performance requirements, some guests may need higher priority >>>>>>>>> for real time usage, some other may need more portion of the GPU >>>>>>>>> resource to get higher 3D performance, corresponding we can define some >>>>>>>>> interfaces like weight/cap for overall budget control, priority for >>>>>>>>> single submission control. >>>>>>>>> >>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in >>>>>>>>> mdev core sysfs for QoS purpose. >>>>>>>> I think what you're asking for is just some standardization of a QoS >>>>>>>> attribute_group which a vendor can optionally include within the >>>>>>>> existing mdev_parent_ops.mdev_attr_groups. The mdev core will >>>>>>>> transparently enable this, but it really only provides the standard, >>>>>>>> all of the support code is left for the vendor. I'm fine with that, >>>>>>>> but of course the trouble with and sort of standardization is arriving >>>>>>>> at an agreed upon standard. Are there QoS knobs that are generic >>>>>>>> across any mdev device type? Are there others that are more specific >>>>>>>> to vGPU? Are there existing examples of this that we can steal their >>>>>>>> specification? >>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted. >>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS >>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only >>>>>>> need to focus on the implementation of the corresponding QoS algorithm >>>>>>> in their back-end driver. >>>>>>> >>>>>>> Vfio-mdev framework provide the capability to share the device that lack >>>>>>> of HW virtualization support to guests, no matter the device type, >>>>>>> mediated sharing actually is a time sharing multiplex method, from this >>>>>>> point of view, QoS can be take as a generic way about how to control the >>>>>>> time assignment for virtual mdev device that occupy HW. As result we can >>>>>>> define QoS knob generic across any device type by this way. Even if HW >>>>>>> has build in with some kind of QoS support, I think it's not a problem >>>>>>> for back-end driver to convert mdev standard QoS definition to their >>>>>>> specification to reach the same performance expectation. Seems there are >>>>>>> no examples for us to follow, we need define it from scratch. >>>>>>> >>>>>>> I proposal universal QoS control interfaces like below: >>>>>>> >>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own >>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of >>>>>>> total physical resource. >>>>>>> >>>>>>> Weight: The weight define proportional control of the mdev device >>>>>>> resource between guests, it’s orthogonal with Cap, to target load >>>>>>> balancing. E.g. if guest 1 should take double mdev device resource >>>>>>> compare with guest 2, need set weight ratio to 2:1. >>>>>>> >>>>>>> Priority: The guest who has higher priority will get execution first, >>>>>>> target to some real time usage and speeding interactive response. >>>>>>> >>>>>>> Above QoS interfaces cover both overall budget control and single >>>>>>> submission control. I will sent out detail design later once get aligned. >>>>>> Hi Alex, >>>>>> Any comments about the interface mentioned above? >>>>> Not really. >>>>> >>>>> Kirti, are there any QoS knobs that would be interesting >>>>> for NVIDIA devices? >>>>> >>>> We have different types of vGPU for different QoS factors. >>>> >>>> When mdev devices are created, its resources are allocated irrespective >>>> of which VM/userspace app is going to use that mdev device. Any >>>> parameter we add here should be tied to particular mdev device and not >>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are >>>> along that line. All mdev device might not need/use these parameters, >>>> these can be made optional interfaces. >>> We also define some QoS parameters in Intel vGPU types, but it only >>> provided a default fool-style way. We still need a flexible approach >>> that give user the ability to change QoS parameters freely and >>> dynamically according to their requirement , not restrict to the current >>> limited and static vGPU types. >>> >>>> In the above proposal, I'm not sure how 'Weight' would work for mdev >>>> devices on same physical device. >>>> >>>> In the above example, "if guest 1 should take double mdev device >>>> resource compare with guest 2" but what if guest 2 never booted, how >>>> will you calculate resources? >>> Cap is try to limit the max physical GPU resource for vGPU, it's a >>> vertical limitation, but weight is a horizontal limitation that define >>> the GPU resource consumption ratio between vGPUs. Cap is easy to >>> understand as it's just a percentage. For weight. for example, if we >>> define the max weight is 16, the vGPU_1 who get weight 8 should been >>> assigned double GPU resources compared to the vGPU_2 whose weight is 4, >>> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4) * >>> total_physical_GPU_resource. >>> >> How will vendor driver provide max weight to userspace >> application/libvirt? Max weight will be per physical device, right? >> >> How would such resource allocation reflect in 'available_instances'? >> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with >> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G >> FB free but you have reached max weight, so will you make >> available_instances = 0 for all types on that physical GPU? > No, per the algorithm above, the available scheduling for the remaining > mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16, > we'd need to define or make the range discoverable, 16 seems rather > arbitrary). We can always add new scheduling participants. AIUI, > Intel uses round-robin scheduling now, where you could consider all > mdev devices to have the same weight. Whether we consider that to be a > weight of 16 or zero or 8 doesn't really matter. QoS is to control the device's process capability like GPU rendering/computing that can be time multiplexing, not used to control the dedicated partition resources like FB, so there is no impact on 'available_instances'. if vGPU_1 weight=8, vGPU_2 weight=4; then vGPU_1_res = 8 / (8 + 4) * total, vGPU_2_res = 4 / (8 + 4) * total; if vGPU_3 created with weight 2; then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) * total, vGPU_3_res = 2 / (8 + 4 + 2) * total. The resource allocation of vGPU_1 and vGPU_2 have been dynamically changed after vGPU_3 creating, that's weight doing as it's to define the relationship of all the vGPUs, the performance degradation is meet expectation. The end-user should know about such behavior. However the argument on weight let me has some self-reflection, does the end-user real need weight? does weight has actually application requirement? Maybe the cap and priority are enough? >>> If there is only one guest exist, then there is no target to compare, >>> weight become meaningless and the single guest enjoy the whole physical GPU. >>> >> If single VM is running for long time say vGPU_1, i.e. it enjoy whole >> GPU, but then other VM boots with weight 4, so you will cut down >> resources of vGPU_1 at runtime? Doesn't that would show performance >> degradation for VM with vGPU_1 at runtime? > Yes. We have this already though, vGPU_1 may enjoy the whole GPU > simply because the other vGPUs are idle, that can change at any time > and may reduce the resources available to vGPU_1. Do we want a QoS > knob for fixed scheduling slices? With only cap, weight, and priority, > how could I provide an SLA for no less than 40% of the GPU? I guess we > can get that with careful use of weight, but I wonder if we could make > it more simple for users. > >>>> If libvirt/other toolstack decides to do smart allocation based on type >>>> name without taking physical host device as input, guest 1 and guest 2 >>>> might get mdev devices created on different physical device. Then would >>>> weightage matter here? >>> What your mean if it's the case that there are two discrete GPU cards >>> exist and the vGPU types can be freely allocated on them, IMO the >>> back-end driver should handle such case, as the number of physical >>> device is transparent to tool stack. e.g. present multi-physical device >>> as a logic one to mdev. >>> >> No, generally toolstack is aware of available physical devices and it >> could have smart logic to decide on which physical device mdev device >> should be created, i.e. to load one physical device first or to >> distribute the load across physical devices when mdev devices are >> created. Libvirt don't have such logic now, but it was discussed earlier >> about having such logic in libvirt. >> Then in that case as I said above doesn't that would show perf >> degradation on running VMs at runtime? > It seems that the proposed cap, weight, and priority only handle QoS > within a single parent device. All the knobs are relative to other > scheduling participants on that parent device. The same QoS parameters > for mdev devices on separate parent devices could have wildly different > performance characteristics depending on the load the other mdev > devices are inflicting. If there's only one such parent device on the > system, this works. libvirt has already effectively rejected the idea > of automating mdev placement and perhaps this is another similar case > where we simply require some higher level management tool to have a > global view of the system. Thanks, Yeah, QoS is only try to handle single parent device. For multi-devices case we need define the management in higher level. Thanks, Ping