MIME-Version: 1.0
References: <20230113175459.14825-1-james.morse@arm.com> <20230113175459.14825-10-james.morse@arm.com>
 <CALPaoCg4T52ju5XJC-BVX-EuZUtc67LruWbgyH5s8CoiEwOUPw@mail.gmail.com>
 <c3ca6d66-e58c-8ace-e88e-45ded5de836f@arm.com> <CALPaoCik0j7ATCv-He5HWVqbL+3njpqO1fhF5FQJO7qqT1zR3w@mail.gmail.com>
 <c8d85eae-e291-99a6-509c-94c41514ac16@arm.com>
In-Reply-To: <c8d85eae-e291-99a6-509c-94c41514ac16@arm.com>
From:   Peter Newman <peternewman@google.com>
Date:   Thu, 9 Mar 2023 14:41:08 +0100
Message-ID: <CALPaoCgEaT2oax35ezRydUZwL9bMmMFFr2wRqPe4VYAnEQ-GGg@mail.gmail.com>
Subject: Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
To:     James Morse <james.morse@arm.com>
Cc:     x86@kernel.org, linux-kernel@vger.kernel.org,
        Fenghua Yu <fenghua.yu@intel.com>,
        Reinette Chatre <reinette.chatre@intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        H Peter Anvin <hpa@zytor.com>,
        Babu Moger <Babu.Moger@amd.com>,
        shameerali.kolothum.thodi@huawei.com,
        D Scott Phillips OS <scott@os.amperecomputing.com>,
        carl@os.amperecomputing.com, lcherian@marvell.com,
        bobo.shaobowang@huawei.com, tan.shaopeng@fujitsu.com,
        xingxin.hx@openanolis.org, baolin.wang@linux.alibaba.com,
        Jamie Iles <quic_jiles@quicinc.com>,
        Xin Hao <xhao@linux.alibaba.com>,
        Stephane Eranian <eranian@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Hi James,

On Wed, Mar 8, 2023 at 6:45=E2=80=AFPM James Morse <james.morse@arm.com> wr=
ote:
> On 06/03/2023 13:14, Peter Newman wrote:
> > On Mon, Mar 6, 2023 at 12:34=E2=80=AFPM James Morse <james.morse@arm.co=
m> wrote:
>
> > Instead, when configuring a counter, could you use the firmware table
> > value to compute the time when the counter will next be valid and retur=
n
> > errors on read requests received before that?
>
> The monitor might get re-allocated, re-programmed and become valid for a =
different
> PARTID+PMG in the mean time. I don't think these things should remain all=
ocated over a
> return to user-space. Without doing that I don't see how we can return-ea=
rly and make
> progress.
>
> How long should a CSU monitor remain allocated to a PARTID+PMG? Currently=
 its only for the
> duration of the read() syscall on the file.
>
>
> The problem with MPAM is too much of it is optional. This particular beha=
viour is only
> valid for CSU monitors, (llc_occupancy), and then, only if your hardware =
designers didn't
> have a value to hand when the monitor is programmed, and need to do a sca=
n of the cache to
> come up with a result. The retry is only triggered if the hardware sets N=
RDY.
> This is also only necessary if there aren't enough monitors for every RMI=
D/(PARTID*PMG) to
> have its own. If there were enough, the monitors could be allocated and p=
rogrammed at
> startup, and the whole thing becomes cheaper to access.
>
> If a hardware platform needs time to do this, it has to come from somewhe=
re. I don't think
> maintaining an epoch based list of which monitor secretly belongs to a PA=
RTID+PMG in the
> hope user-space reads the file again 'quickly enough' is going to be main=
tainable.
>
> If returning errors early is an important use-case, I can suggest ensurin=
g the MPAM driver
> allocates CSU monitors up-front if there are enough (today it only does t=
his for MBWU
> monitors). We then have to hope that folk who care about this also build =
hardware
> platforms with enough monitors.

Thanks, this makes more sense now. Since CSU data isn't cumulative, I
see how synchronously collecting a snapshot is useful in this situation.
I was more concerned about understanding the need for the new behavior
than getting errors back quickly.

However, I do want to be sure that MBWU counters will never be silently
deallocated because we will never be able to trust the data unless we
know that the counter has been watching the group's tasks for the
entirety of the measurement window.

Unlike on AMD, MPAM allows software to control which PARTID+PMG the
monitoring hardware is watching. Could we instead make the user
explicitly request the mbm_{total,local}_bytes events be allocated to
monitoring groups after creating them? Or even just allocating the
events on monitoring group creation only when they're available could
also be marginably usable if a single user agent is managing rdtgroups.

Thanks!
-Peter