Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754639AbbHCSW1 (ORCPT ); Mon, 3 Aug 2015 14:22:27 -0400 Received: from mga09.intel.com ([134.134.136.24]:62797 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753790AbbHCSWX (ORCPT ); Mon, 3 Aug 2015 14:22:23 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,603,1432623600"; d="scan'208";a="618450314" Date: Mon, 3 Aug 2015 11:22:22 -0700 (PDT) From: Vikas Shivappa X-X-Sender: vikas@vshiva-Udesk To: Marcelo Tosatti cc: Martin Kletzander , Vikas Shivappa , "Auld, Will" , Vikas Shivappa , "linux-kernel@vger.kernel.org" , "x86@kernel.org" , "hpa@zytor.com" , "tglx@linutronix.de" , "mingo@kernel.org" , "tj@kernel.org" , "peterz@infradead.org" , "Fleming, Matt" , "Williamson, Glenn P" , "Juvva, Kanaka D" Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide In-Reply-To: <20150803151307.GA8228@amt.cnet> Message-ID: References: <1435789270-27010-1-git-send-email-vikas.shivappa@linux.intel.com> <1435789270-27010-4-git-send-email-vikas.shivappa@linux.intel.com> <20150728231516.GA16204@amt.cnet> <96EC5A4F3149B74492D2D9B9B1602C27461EB932@ORSMSX105.amr.corp.intel.com> <20150729193208.GC3201@amt.cnet> <20150730200812.GA10832@amt.cnet> <20150802154807.GA19188@wheatley> <20150803151307.GA8228@amt.cnet> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9902 Lines: 266 Hello Marcelo/Martin, Like I mentioned let me modify the documentation to better help understand the usage. Things like updating each package bitmask is already in the patches. Lets discuss offline and come up a well defined proposal for change if any and then update that in next series. We seem to be just looping over same items. Thanks, Vikas On Mon, 3 Aug 2015, Marcelo Tosatti wrote: > On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote: >> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote: >>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote: >>>> >>>> >>>> Marcello, >>>> >>>> >>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote: >>>>> >>>>> How about this: >>>>> >>>>> desiredclos (closid p1 p2 p3 p4) >>>>> 1 1 0 0 0 >>>>> 2 0 0 0 1 >>>>> 3 0 1 1 0 >>>> >>>> #1 Currently in the rdt cgroup , the root cgroup always has all the >>>> bits set and cant be changed (because the cgroup hierarchy would by >>>> default make this to have all bits as all the children need to have >>>> a subset of the root's bitmask). So if the user creates a cgroup and >>>> not put any task in it , the tasks in the root cgroup could be still >>>> using that part of the cache. Thats the reason i say we can have >>>> really 'exclusive' masks. >>>> >>>> Or in other words - there is always a desired clos (0) which has all >>>> parts set which acts like a default pool. >>>> >>>> Also the parts can overlap. Please apply this for all the below >>>> comments which will change the way they work. >>>> >>>>> >>>>> p means part. >>>> >>>> I am assuming p = (a contiguous cache capacity bit mask) >>> >>> Yes. >>> >>>>> closid 1 is a exclusive cgroup. >>>>> closid 2 is a "cache hog" class. >>>>> closid 3 is "default closid". >>>>> >>>>> Desiredclos is what user has specified. >>>>> >>>>> Transition 1: desiredclos --> effectiveclos >>>>> Clean all bits of unused closid's >>>>> (that must be updated whenever a >>>>> closid1 cgroup goes from empty->nonempty >>>>> and vice-versa). >>>>> >>>>> effectiveclos (closid p1 p2 p3 p4) >>>>> 1 0 0 0 0 >>>>> 2 0 0 0 1 >>>>> 3 0 1 1 0 >>>> >>>>> >>>>> Transition 2: effectiveclos --> expandedclos >>>>> expandedclos (closid p1 p2 p3 p4) >>>>> 1 0 0 0 0 >>>>> 2 0 0 0 1 >>>>> 3 1 1 1 0 >>>>> Then you have different inplacecos for each >>>>> CPU (see pseudo-code below): >>>>> >>>>> On the following events. >>>>> >>>>> - task migration to new pCPU: >>>>> - task creation: >>>>> >>>>> id = smp_processor_id(); >>>>> for (part = desiredclos.p1; ...; part++) >>>>> /* if my cosid is set and any other >>>>> cosid is clear, for the part, >>>>> synchronize desiredclos --> inplacecos */ >>>>> if (part[mycosid] == 1 && >>>>> part[any_othercosid] == 0) >>>>> wrmsr(part, desiredclos); >>>>> >>>> >>>> Currently the root cgroup would have all the bits set which will act >>>> like a default cgroup where all the otherwise unused parts (assuming >>>> they are a set of contiguous cache capacity bits) will be used. >>>> >>>> Otherwise the question is in the expandedclos - who decides to >>>> expand the closx parts to include some of the unused parts.. - that >>>> could just be a default root always ? >>> >>> Right, so the problem is for certain closid's you might never want >>> to expand (because doing so would cause data to be cached in a >>> cache way which might have high eviction rate in the future). >>> See the example from Will. >>> >>> But for the default cache (that is "unclassified applications" >>> i suppose it is beneficial to expand in most cases, that is, >>> use maximum amount of cache irrespective of eviction rate, which >>> is the behaviour that exists now without CAT). >>> >>> So perhaps a new flag "expand=y/n" can be added to the cgroup >>> directories... What do you say? >>> >>> Userspace representation of CAT >>> ------------------------------- >>> >>> Usage model: >>> 1) measure application performance without L3 cache reservation. >>> 2) measure application perf with L3 cache reservation and >>> X number of cache ways until desired performance is attained. >>> >>> Requirements: >>> 1) Persistency of CLOS configuration across hardware. On migration >>> of operating system or application between different hardware >>> systems we'd like the following to be maintained: >>> - exclusive number of bytes (*) reserved to a certain CLOSid. >>> - shared number of bytes (*) reserved between a certain group >>> of CLOSid's. >>> >>> For both code and data, rounded down or up in cache way size. >>> >>> 2) Reasoning: >>> Different CBM masks in different hardware platforms might be necessary >>> to specify the same CLOS configuration, in terms of exclusive number of >>> bytes and shared number of bytes. (cache-way rounded number of bytes). >>> For example, due to L3 allocation by other hardware entities in certain parts >>> of the cache it might be necessary to relocate CBM mask to achieve >>> the same CLOS configuration. >>> >>> 3) Proposed format: >>> >> >> Few questions from a random listener, I apologise if some of them are >> in a wrong place due to me missing some information from past threads. >> >> I'm not sure whether the following proposal to the format is the >> internal structure or what's going to be in cgroups. If this is >> user-visible interface, I think it could be a little less detailed. > > User visible interface. The idea is to have userspace code that performs > > [ user visible specification ] ----> [ cbm bitmasks on present hardware > platform ] > > In systemd, probably (or whatever is between the user and the cgroup > interface). > >>> sharedregionK.exclusive - Number of exclusive cache bytes reserved for >>> shared region. >>> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for >>> shared region. >>> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for >>> shared region. >>> sharedregionK.round_down - Round down to cache way bytes from respective number >>> specification (default is round up). >>> sharedregionK.expand - y/n - Expand shared region to more cache ways >>> when available (default N). >>> >>> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved >>> for cgroup. >>> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved >>> for cgroup. >>> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved >>> for cgroup. >> >> By exclusive, you mean that it's exclusive to the tasks in this >> cgroup? > > Correct. > >> The thing is that we must differentiate between limiting some >> process's from hogging the memory (like example 2 below) and making >> some part of the cache exclusive for particular application (example 1 >> below). > > AFAICS there is no difference because: both require exclusive cache > access: the hog wants exclusive access between any other user of its > cachelines will be penalized. the high performance application wants > exclusive cache access because any other user of its cachelines will > penalize it. > > Where do you see the need to differentiate? > >> I just hope we won't need to add something similar to 'isolcpus=' just >> so we can make sure none of the tasks in the root cgroup can spoil the >> part of the cache we need to have exclusive. >> >> I'm not sure creating a new subgroup and moving all the tasks there >> would work, It certainly is not possible with other cgroups, like the >> cpuset cgroup mentioned beforehand. > > Why not? Should be able to place all tasks in a given cgroup? (trying > to setup systemd to do that now...). > >> I also don't quite fully understand how the co-mounting with the >> cpuset cgroup should work, but that's not design-related. > > Neither do I. > >> One more question, how does this work on systems with multiple L3 >> caches (e.g. large NUMA node systems)? I'm guessing if the process is >> running only on some CPUs, the wrmsr() will be called on that >> particular CPU(s), right? > > Not in the current patchset, that has to be fixed... > >>> cgroupN.round_down - Round down to cache way bytes from respective number >>> specification (default is round up). >>> cgroupN.expand - y/n - Expand shared region to more cache ways when >>> available (default N). >>> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared >>> regions) >>> >>> Example 1: >>> One application with 2M exclusive cache, two applications >>> with 1M exclusive each, sharing an expansive shared region of 1M. >>> >>> cgroup1.exclusive = 2M >>> >>> sharedregion1.exclusive = 1M >>> sharedregion1.expand = Y >>> >>> cgroup2.exclusive = 1M >>> cgroup2.shared = sharedregion1 >>> >>> cgroup3.exclusive = 1M >>> cgroup3.shared = sharedregion1 >>> >>> Example 2: >>> 3 high performance applications running, one of which is a cache hog >>> with no cache locality. >>> >>> cgroup1.exclusive = 8M >>> cgroup2.exclusive = 8M >>> >>> cgroup3.exclusive = 512K >>> cgroup3.round_down = Y >>> >>> In all cases the default cgroup (which requires no explicit >>> specification) is expansive and uses the remaining cache >>> ways, including the ways shared by other hardware entities. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/