Date: Wed, 5 Jul 2017 19:25:07 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Peter Zijlstra <peterz@infradead.org>
cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>, x86@kernel.org,
        linux-kernel@vger.kernel.org, hpa@zytor.com, ravi.v.shankar@intel.com,
        vikas.shivappa@intel.com, tony.luck@intel.com, fenghua.yu@intel.com,
        andi.kleen@intel.com
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring
 ID) management
In-Reply-To: <20170705153439.xudhew5wpq3liivf@hirez.programming.kicks-ass.net>
Message-ID: <alpine.DEB.2.20.1707051846020.2019@nanos>
References: <1498503368-20173-1-git-send-email-vikas.shivappa@linux.intel.com> <1498503368-20173-9-git-send-email-vikas.shivappa@linux.intel.com> <alpine.DEB.2.20.1707021119300.2296@nanos> <alpine.DEB.2.20.1707030954330.2188@nanos>
 <20170705153439.xudhew5wpq3liivf@hirez.programming.kicks-ass.net>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2483
Lines: 76

On Wed, 5 Jul 2017, Peter Zijlstra wrote:

> On Mon, Jul 03, 2017 at 11:55:37AM +0200, Thomas Gleixner wrote:
> 
> > 
> > 	if (static_branch_likely(&rdt_mon_enable_key)) {
> > 		if (unlikely(current->rmid)) {
> > 			newstate.rmid = current->rmid;
> > 			__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
> 
> Non atomic op
> 
> > 		}
> > 	}
> > 
> > Now in rmid_free() we can collect that information:
> > 
> > 	cpumask_clear(&tmpmask);
> > 	cpumask_clear(rmid_entry->mask);
> > 
> > 	cpus_read_lock();
> > 	for_each_online_cpu(cpu) {
> > 		if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
> 
> atomic op

Indeed. We need atomic on both sides unfortunately.

> > 			cpumask_set(cpu, tmpmask);
> > 	}
> > Another thing which needs some thought it the CPU hotplug code. We need to
> > make sure that pending work which is scheduled on an outgoing CPU is moved
> > in the offline callback to a still online CPU of the same domain and not
> > moved to some random CPU by the workqueue hotplug code.
> 
> just flush the workqueue for that CPU? That's what the workqueue core
> _should_ do in any case. And that also covers the case where @cpu is the
> last in the set of CPUs we could run on.

Indeed.

> > There is another subtle issue. Assume a RMID is freed. The limbo stuff is
> > scheduled on all domains which have online CPUs.
> > 
> > Now the last CPU of a domain goes offline before the threshold for clearing
> > the domain CPU bit in the rme->mask is reached.
> > 
> > So we have two options here:
> > 
> >    1) Clear the bit unconditionally when the last CPU of a domain goes
> >       offline.
> 
> Arguably this. This is cache level stuff, that means this is the last
> CPU of a cache, so just explicitly kill the _entire_ cache and insta
> mark everything good again; WBINVD ftw.

Right.

> >    2) Arm a timer which clears the bit after a grace period
> > 
> > #1 The RMID might become available for reuse right away because all other
> >    domains have not used it or have cleared their bits already.
> >    
> >    If one of the CPUs of that domain comes online again and is associated
> >    to that reused RMID again, then the counter content might still contain
> >    leftovers from the previous usage.
> 
> Not if we kill the cache on offline -- also, if all CPUs have been
> offline, its not too weird to expect something like a package idle state
> to have happened and shot down the caches anyway.

Yes, didn't think about that.

Thanks,

	tglx