Date: Mon, 3 Jul 2017 11:55:37 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Vikas Shivappa <vikas.shivappa@linux.intel.com>
cc: x86@kernel.org, linux-kernel@vger.kernel.org, hpa@zytor.com,
        peterz@infradead.org, ravi.v.shankar@intel.com,
        vikas.shivappa@intel.com, tony.luck@intel.com, fenghua.yu@intel.com,
        andi.kleen@intel.com
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring
 ID) management
In-Reply-To: <alpine.DEB.2.20.1707021119300.2296@nanos>
Message-ID: <alpine.DEB.2.20.1707030954330.2188@nanos>
References: <1498503368-20173-1-git-send-email-vikas.shivappa@linux.intel.com> <1498503368-20173-9-git-send-email-vikas.shivappa@linux.intel.com> <alpine.DEB.2.20.1707021119300.2296@nanos>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4506
Lines: 144

On Sun, 2 Jul 2017, Thomas Gleixner wrote:
> Thinking a bit more about that limbo mechanics.
> 
> In case that a RMID was never used on a particular package, the state check
> forces an IPI on all packages unconditionally. That's suboptimal at least.
> 
> We know on which package a given RMID was used, so we could restrict the
> checks to exactly these packages, but I'm not sure it's worth the
> trouble. We might at least document that and explain why this is
> implemented in that way.

Second thoughts on that. The allocation logic is:

> +       if (list_empty(&rmid_free_lru)) {
> +               ret = try_freeing_limbo_rmid();
> +               if (list_empty(&rmid_free_lru))
> +                       return ret ? -ENOSPC : -EBUSY;
> +       }
> +
> +       entry = list_first_entry(&rmid_free_lru,
> +                                struct rmid_entry, list);
> +       list_del(&entry->list);
> +
> +       return entry->rmid;

That means, the free list is used as the primary source. One of my boxes
has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
RMIDs to the limbo list. On the next mkdir invocation the allocation goes
into the limbo path and the SMP function call has to walk the list with 142
entries on ALL online domains whether they used the RMID or not!

That's bad enough already and the number of RMIDs will not become smaller;
it doubled from HSW to BDW ...

The HPC and RT folks will love you for that - NOT!

So this needs to be solved differently.

Let's have a look at the context switch path first. That's the most
sensitive part of it.

	if (static_branch_likely(&rdt_mon_enable_key)) {
		if (current->rmid)
			newstate.rmid = current->rmid;
	}

That's optimized for the !monitoring case. So we can really penalize the
per task monitoring case.

	if (static_branch_likely(&rdt_mon_enable_key)) {
		if (unlikely(current->rmid)) {
			newstate.rmid = current->rmid;
			__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
		}
	}

Now in rmid_free() we can collect that information:

	cpumask_clear(&tmpmask);
	cpumask_clear(rmid_entry->mask);

	cpus_read_lock();
	for_each_online_cpu(cpu) {
		if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
			cpumask_set(cpu, tmpmask);
	}

	for_each_domain(d, resource) {
		cpu = cpumask_any_and(d->cpu_mask, tmpmask);
		if (cpu < nr_cpu_ids)
			cpumask_set(cpu, rmid_entry->mask);
	}

	list_add(&rmid_entry->list, &limbo_list);

	for_each_cpu(cpu, rmid_entry->mask)
		schedule_delayed_work_on(cpu, rmid_work);
	cpus_read_unlock();

The work function:

    	boot resched = false;

    	list_for_each_entry(rme, limbo_list,...) {
		if (!cpumask_test_cpu(cpu, rme->mask))
			continue;

		if (!rmid_is_reusable(rme)) {
			resched = true;
			continue;
		}

		cpumask_clear_cpu(cpu, rme->mask);
		if (!cpumask_empty(rme->mask))
			continue;

		/* Ready for reuse */
		list_del(rme->list);
		list_add(&rme->list, &free_list);
	}	

The alloc function then becomes:

	if (list_empty(&free_list))
		return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;

The switch_to() covers the task rmids. The per cpu default rmids can be
marked at the point where they are installed on a CPU in the per cpu
rmid_bitmap. The free path is the same for per task and per cpu.

Another thing which needs some thought it the CPU hotplug code. We need to
make sure that pending work which is scheduled on an outgoing CPU is moved
in the offline callback to a still online CPU of the same domain and not
moved to some random CPU by the workqueue hotplug code.

There is another subtle issue. Assume a RMID is freed. The limbo stuff is
scheduled on all domains which have online CPUs.

Now the last CPU of a domain goes offline before the threshold for clearing
the domain CPU bit in the rme->mask is reached.

So we have two options here:

   1) Clear the bit unconditionally when the last CPU of a domain goes
      offline.

   2) Arm a timer which clears the bit after a grace period

#1 The RMID might become available for reuse right away because all other
   domains have not used it or have cleared their bits already.
   
   If one of the CPUs of that domain comes online again and is associated
   to that reused RMID again, then the counter content might still contain
   leftovers from the previous usage.

#2 Prevents #1 but has it's own issues vs. serialization and coordination
   with CPU hotplug.

I'd say we go for #1 as the simplest solution, document it and if really
the need arises revisit it later.

Thanks,

	tglx