Received: by 2002:a05:6358:111d:b0:dc:6189:e246 with SMTP id f29csp1892114rwi; Thu, 3 Nov 2022 10:11:00 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5dUm0LDjGF7OjbBJGYOW/o1lofYShRNsRhNFWGCH5z6aha58UNFWT2qImzJiByDnTEfRto X-Received: by 2002:a17:907:6e0d:b0:7ae:2277:9fc9 with SMTP id sd13-20020a1709076e0d00b007ae22779fc9mr2702343ejc.334.1667495460309; Thu, 03 Nov 2022 10:11:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667495460; cv=none; d=google.com; s=arc-20160816; b=Z3zxWT9q1Z5FjbAeBF56u/jKJuXOEHKj7ZxAwemIfWD9oFuI251xqkZoakRMQDHTWk eLhWzX7taijJx6t5T26sIUtIpnmOCUhXjtOsyNtWvhWiq0QlMJGj2ecvDg/KYnm1DUDG AReJb/w8aRLhf5Wq1ATTM7g+faP9scIATPar/kITfsgUkRc74gL3nqMqZstxiVxR7t6H yoBzp/wTw05q+3bdCbvgLBGDeCMP/bIa7u5D+gMqwgtVbyMwXOE9F3Gz1aFubf8pjFZB 9UWq1tQtN4cPRbmZb5tNFMrvxKkjKlZJohv0+neqpr+WWR9cCWqVXnVnNY7Xtqeb2PXT TyVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=1moQ13/em2NHyNxAj1q4clFxRKKR2ymF5M5IH3I5adg=; b=qrBJYGQwvEU2Q4dCPJl0QlXXP+NbszP1ED/BWycMzBU1G1JCoLisg8Jhb9ICflOqa9 FWX1jdXPdQEuUkWjpGC3FwF/6WHSaysLuIo4xI9iJY+iJVH/pnfN1wk8gqhdI8/afQve bjySjYafoyExeXT+DCjGCXCqQLbXRVZfG6OnZJoeduPoqThEMtiW8ya1v/A3lNFviOQy a4DyjlvQvjk8zb1AiFFHiqocnO1Nz9LrKmz1Nc+DzfILGBPf4bjhztvx6GGm6QHzDMoj tNP9+PdpPplU2fKtmvfE2dDJsMrHd9M2viQN0qPhsPpZU0YisLvvF14zB5FcUdgproMM qERw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f25-20020a1709064dd900b00787c0e9818csi1329488ejw.568.2022.11.03.10.10.35; Thu, 03 Nov 2022 10:11:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231627AbiKCRHm (ORCPT + 97 others); Thu, 3 Nov 2022 13:07:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231559AbiKCRHQ (ORCPT ); Thu, 3 Nov 2022 13:07:16 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 92146FE1 for ; Thu, 3 Nov 2022 10:06:34 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 976351FB; Thu, 3 Nov 2022 10:06:40 -0700 (PDT) Received: from [10.1.197.38] (eglon.cambridge.arm.com [10.1.197.38]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B9D613F5A1; Thu, 3 Nov 2022 10:06:32 -0700 (PDT) Message-ID: Date: Thu, 3 Nov 2022 17:06:25 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group Content-Language: en-GB To: Reinette Chatre , Peter Newman Cc: Tony Luck , "Yu, Fenghua" , "Eranian, Stephane" , "linux-kernel@vger.kernel.org" , Thomas Gleixner , Babu Moger , Gaurang Upasani References: <81a7b4f6-fbb5-380e-532d-f2c1fc49b515@intel.com> <76bb4dc9-ab7c-4cb6-d1bf-26436c88c6e2@arm.com> <835d769b-3662-7be5-dcdd-804cb1f3999a@arm.com> <09029c7a-489a-7054-1ab5-01fa879fb42f@intel.com> From: James Morse In-Reply-To: <09029c7a-489a-7054-1ab5-01fa879fb42f@intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Reinette, (I've not got to the last message in this part of the thread yes - I'm out of time this week, back Monday!) On 21/10/2022 21:09, Reinette Chatre wrote: > On 10/19/2022 6:57 AM, James Morse wrote: >> On 17/10/2022 11:15, Peter Newman wrote: >>> On Wed, Oct 12, 2022 at 6:55 PM James Morse wrote: >>>> You originally asked: >>>> | Any concerns about the CLOSID-reusing behavior? >>>> >>>> I don't think this will work well with MPAM ... I expect it will mess up the bandwidth >>>> counters. >>>> >>>> MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers, >>>> this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively >>>> extended the PARTID with bits that aren't used to look up the configuration. >>>> >>>> x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more >>>> complicated. I've seen details of a system that only has 1 bit of PMG space. >>>> >>>> While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough >>>> unique PMG for every control/monitor group to have a unique value. Instead, MPAM's >>>> monitors are expected to be used with both the PARTID and PMG. >>>> >>>> ('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't >>>> match on just PMG at all - they have to be told the PARTID too) >>>> >>>> >>>> If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM >>>> systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and >>>> the monitors have to match on both. >> >>> Yes, that sounds like it would be an issue. >>> >>> Following your refactoring changes, hopefully the MPAM driver could >>> offer alternative methods for managing PARTIDs and PMGs depending on the >>> available hardware resources. >> >> Mmmm, I don't think anything other than one-partid per control group and one-pmg per >> monitor group makes much sense. >> >> >>> If there are a lot more PARTIDs than PMGs, then it would fit well with a >>> user who never creates child MON groups. In case the number of MON >>> groups gets ahead of the number of CTRL_MON groups and you've run out of >>> PMGs, perhaps you would just try to allocate another PARTID and program >>> the same partitioning configuration before giving up. >> >> User-space can choose to do this. >> If the kernel tries to be clever and do this behind user-space's back, it needs to >> allocate two monitors for this secretly-two-control-groups, and always sum the counters >> before reporting them to user-space. > If I understand this scenario correctly, the kernel is already doing this. > As implemented in mon_event_count() the monitor data of a CTRL_MON group is > the sum of the parent CTRL_MON group and all its child MON groups. That is true. MPAM has an additional headache here as it needs to allocate a monitor in order to read the counters. If there are enough monitors for each CLOSID*RMID to have one, then MPAM can export the counter files in the same way RDT does. While there are systems that have enough monitors, I don't think this is going to be the norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters) The problem is moving a group of tasks around N groups requires N monitors to be allocated, and stay allocated until those groups pass through limbo. The perf stuff can't allocate more monitors once its started. Even without perf, the only thing that limits the list of other counters that have to be read is the number of PARTID*PMG. It doesn't look like a very sensible design. >> If monitors are a contended resource, then you may be unable to monitor the >> secretly-two-control-groups group once the kernel has done this. > > I am not viewing this as "secretly-two-control-groups" - there would still be > only one parent CTRL_MON group that dictates all the allocations. MON groups already > have a CLOSID (PARTID) property but at this time it is always identical to the parent > CTRL_MON group. The difference introduced is that some of the child MON groups > may have a different CLOSID (PARTID) from the parent. > >> >> I don't think the kernel should try to be too clever here. > That is a fair concern but it may be worth exploring as it seems to address > a few ABI concerns and user space seems to be eyeing using a future "num_closid" > info as a check of "RDT/PQoS" vs "MPAM". I think the solution to all this is: * Add rename support to move a monitor group between two control groups. ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues unaffected. ** On arm64, the PARTID is also relevant to the monitors, so the old counters will continue to count. Whether this old counters keep counting needs exposing to user-space so that it is aware. To solve Peter's use-case, we also need: * to expose how many new groups can be created at each level. This is because MPAM doesn't have a property like num_rmid. Combined, these should solve the cases Peter describes. User-space can determine if the platform is control-group-rich or monitor-group-rich, and build the corresponding structure to make best use of the resources. Thanks, James