MIME-Version: 1.0
In-Reply-To: <1481929988-31569-12-git-send-email-vikas.shivappa@linux.intel.com>
References: <1481929988-31569-1-git-send-email-vikas.shivappa@linux.intel.com> <1481929988-31569-12-git-send-email-vikas.shivappa@linux.intel.com>
From: David Carrillo-Cisneros <davidcc@google.com>
Date: Fri, 23 Dec 2016 03:58:19 -0800
Message-ID: <CALcN6mgiy__nqxvb9-tzNPt9HEbdUC0+eiwsLqer+UYqPXitHA@mail.gmail.com>
Subject: Re: [PATCH 11/14] x86/cqm: Add failure on open and read
To: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>,
        linux-kernel <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@intel.com>,
        "Luck, Tony" <tony.luck@intel.com>, Fenghua Yu <fenghua.yu@intel.com>,
        andi.kleen@intel.com, Stephane Eranian <eranian@google.com>,
        hpa@zytor.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7169
Lines: 180

On Fri, Dec 16, 2016 at 3:13 PM, Vikas Shivappa
<vikas.shivappa@linux.intel.com> wrote:
> To provide reliable output to the user, cqm throws error when it does
> not have enough RMIDs to monitor depending upon the mode user choses.
> This also takes care to not overuse RMIDs. Default is LAZY mode.
>
> NOLAZY mode: This patch adds a file mon_mask in the perf_cgroup which
> indicates the packages which the user wants guaranteed monitoring. For
> such cgroup events RMIDs are assigned at event create and we fail if
> enough RMIDs are not present. This is basically a NOLAZY allocation of
> RMIDs. This mode can be used in real time scenarios where user is sure
> that tasks that are monitored are scheduled.
>
> LAZY mode: If user did not enable the NOLAZY mode, RMIDs are allocated
> only when tasks are actually scheduled. Upon failure to obtain RMIDs it
> indicates a failure in read. Typical use case for this mode could be to
> start monitoring cgroups which still donot have any tasks in them and
> such cgroups are part of large number of cgroups which are monitored -
> that way we donot overuse RMIDs.
>

The proposed interface is:
  - a global boolean cqm_cont_monitoring.
  - a per-package boolean in the bitfield cqm_mon_mask.

So, for each package there will be four states, yet one of them is
not meaningful:

cont_monitoring, cqm_mon_mask[p]: meaning
------------------------------------------
0, 0 : off
0, 1 : off but reserve a RMID that is not going to be used?
1, 0 : on with NOLAZY
1, 1 : on with LAZY

the case 0,1 is problematic.

How can new cases be added in the future? another file? What's wrong
with having a
pkg0_flags;pkg1_flags;...;pkgn_flags
cont_monitoring file, that is more akin to the RDT Allocation format.
(There is a parser function and implementation for that format in v3
of my CMT series).


Below is a full discussion about how many per-package configuration states are
useful now and if/when RMID rotation is added.


There are two types of error sources introduced by not having a RMID
when one is needed:
  - E_read : Introduced into the measurement when stealing a
    RMID with non-zero occupancy.
  - E_sched: Introduced when a thread runs but no RMID is available for it.

A user may have two tolerance levels to errors that determine if an
event can be read
or read should fail:
  - NoTol  : No tolerance to error at all. If there has been any type
of E_read or
    E_sched in the past, read must give an error.
  - SomeTol: Tolerate _some_ error. It can be defined in terms of
time, magnitude or both.
    As an example, in v3 of my CMT patches, I assumed a user would
tolerate an error that
    occurred more that an arbitrarily chosen time in the past. The
minimum criterion is that
    there should at least be a RMID at the time of read.

The driver can follow two types of RMID allocation policies:
  - NoLazy: reserve RMID as soon as user starts monitoring (when event
is created or
    cont_monitoring is set). This policy introduces no error.
  - Lazy: reserve RMID first time a task is sched in. May introduce E_sched
    if no RMID available on sched in.

and three RMID deallocation policies:
  - Fixed: RMID can never be stolen. This policy introduces no error
into the measurement.
  - Reuse: RMID can be stolen when not scheduled thread is using it
and it has non-zero
    occupancy. This policy may introduce E_sched when no RMID available on
    sched_in after an incidence of reuse.
  - Steal: RMID can be stolen any time. This policy introduces both E_sched and
    E_read errors into the measurement (this is the so-called RMID rotation).

Therefore there are three possible risks levels:
  - No Risk: possible with NoLazy & Fixed
  - Risk of E_sched: possible with either NoLazy & Reuse or Lazy &
Fixed  or Lazy & Reuse
  - Risk of E_sched and E_read: possible with NoLazy & Steal or Lazy & Steal
Notes:
  a) E_read only is impossible.
  b) In "No Risk" a RMID must be allocated in advance and never
released, even if unused
      (e.g. a task may run only in one package but we allocade RMID in
all of them).
  c) For the E_sched risk, Lazy & Reuse give the highest RMID flexibility.
  d) For the E_read and E_sched risk, NoLazy & Steal give the highest
RMID flexibility.


Combining all three criteria, the possible configuration modes that
make sense are:
  1) No monitoring.
  2) NoLazy & Fixed & NoTol. RMID is allocated when event is created
(or cont_monitoring is set).
      No possible error. May waste RMIDs.
  3) Lazy & Reusable & NoTol. RMID are allocated as needed, taken away
when unused.
      May fail to find RMID if there is RMID contention, once it fail,
the event/cgroup must be in error state.
  4) Lazy & Reusable & SomeTol. Similar to (3) but event/cgroup
recovers from error state if a
      recovered RMID stays valid for long enough.
  5 and 6) Lazy allocation & Stealable with and without Tol . RMID can
be stolen even if non-empty
     or in use.

Q. Which modes are useful?

Stephane and I see a clear use for (2). Users of cont_monitoring look
to avoid error and may tolerate
wasted RMIDs. It has the advantage that allows to fail on event
creation (or when cont_monitoring is set).
This is the same mode introduced with NOLAZY in cqm_mon_mask in this patch.

Mode (3) can be viewed as an optimistic approach to RMID allocation
that allows more concurrent users
than 2 when cache occupancy drops quickly and/or task/cgroups manifest
strong package locality.
It still guarantees exact measurements (within hw constraints) when
read succeeds.

Mode (4) is more useful than 3 _if_ it can be assumed that the system
will replace enough cache lines
before the tolerance time expires (otherwise it reads just garbage).
Yet, it's not clear to me how often this assumption is valid.

Modes (5) and (6) require RMID rotation, so they wouldn't be part of
this patch series.


> +static ssize_t cqm_mon_mask_write(struct kernfs_open_file *of,
> +                                   char *buf, size_t nbytes, loff_t off)
> +{
> +       cpumask_var_t tmp_cpus, tmp_cpus1;
> +       struct cgrp_cqm_info *cqm_info;
> +       unsigned long flags;
> +       int ret = 0;
> +
> +       buf = strstrip(buf);
> +
> +       if (!zalloc_cpumask_var(&tmp_cpus, GFP_KERNEL) ||
> +               !zalloc_cpumask_var(&tmp_cpus1, GFP_KERNEL)) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       ret = cpulist_parse(buf, tmp_cpus);
> +       if (ret)
> +               goto out;
> +
> +       if (cpumask_andnot(tmp_cpus1, tmp_cpus, &cqm_pkgmask)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       raw_spin_lock_irqsave(&cache_lock, flags);
> +       cqm_info = css_to_cqm_info(of_css(of));
> +       cpumask_copy(&cqm_info->mon_mask, tmp_cpus);
> +       raw_spin_unlock_irqrestore(&cache_lock, flags);

So this only copies the mask so that it can be used for the next
cgroup event in intel_cqm_setup_event?
That defeats the purpose of a NON_LAZY cont_monitoring.

There is no need to create a new cgroup file only to provide a non-lazy event;
such flag could be passed in perf_event_attr::pinned or a config field.