Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753868AbdGCJzv (ORCPT ); Mon, 3 Jul 2017 05:55:51 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:42584 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753602AbdGCJzs (ORCPT ); Mon, 3 Jul 2017 05:55:48 -0400 Date: Mon, 3 Jul 2017 11:55:37 +0200 (CEST) From: Thomas Gleixner To: Vikas Shivappa cc: x86@kernel.org, linux-kernel@vger.kernel.org, hpa@zytor.com, peterz@infradead.org, ravi.v.shankar@intel.com, vikas.shivappa@intel.com, tony.luck@intel.com, fenghua.yu@intel.com, andi.kleen@intel.com Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management In-Reply-To: Message-ID: References: <1498503368-20173-1-git-send-email-vikas.shivappa@linux.intel.com> <1498503368-20173-9-git-send-email-vikas.shivappa@linux.intel.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4506 Lines: 144 On Sun, 2 Jul 2017, Thomas Gleixner wrote: > Thinking a bit more about that limbo mechanics. > > In case that a RMID was never used on a particular package, the state check > forces an IPI on all packages unconditionally. That's suboptimal at least. > > We know on which package a given RMID was used, so we could restrict the > checks to exactly these packages, but I'm not sure it's worth the > trouble. We might at least document that and explain why this is > implemented in that way. Second thoughts on that. The allocation logic is: > + if (list_empty(&rmid_free_lru)) { > + ret = try_freeing_limbo_rmid(); > + if (list_empty(&rmid_free_lru)) > + return ret ? -ENOSPC : -EBUSY; > + } > + > + entry = list_first_entry(&rmid_free_lru, > + struct rmid_entry, list); > + list_del(&entry->list); > + > + return entry->rmid; That means, the free list is used as the primary source. One of my boxes has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all RMIDs to the limbo list. On the next mkdir invocation the allocation goes into the limbo path and the SMP function call has to walk the list with 142 entries on ALL online domains whether they used the RMID or not! That's bad enough already and the number of RMIDs will not become smaller; it doubled from HSW to BDW ... The HPC and RT folks will love you for that - NOT! So this needs to be solved differently. Let's have a look at the context switch path first. That's the most sensitive part of it. if (static_branch_likely(&rdt_mon_enable_key)) { if (current->rmid) newstate.rmid = current->rmid; } That's optimized for the !monitoring case. So we can really penalize the per task monitoring case. if (static_branch_likely(&rdt_mon_enable_key)) { if (unlikely(current->rmid)) { newstate.rmid = current->rmid; __set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap)); } } Now in rmid_free() we can collect that information: cpumask_clear(&tmpmask); cpumask_clear(rmid_entry->mask); cpus_read_lock(); for_each_online_cpu(cpu) { if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap))) cpumask_set(cpu, tmpmask); } for_each_domain(d, resource) { cpu = cpumask_any_and(d->cpu_mask, tmpmask); if (cpu < nr_cpu_ids) cpumask_set(cpu, rmid_entry->mask); } list_add(&rmid_entry->list, &limbo_list); for_each_cpu(cpu, rmid_entry->mask) schedule_delayed_work_on(cpu, rmid_work); cpus_read_unlock(); The work function: boot resched = false; list_for_each_entry(rme, limbo_list,...) { if (!cpumask_test_cpu(cpu, rme->mask)) continue; if (!rmid_is_reusable(rme)) { resched = true; continue; } cpumask_clear_cpu(cpu, rme->mask); if (!cpumask_empty(rme->mask)) continue; /* Ready for reuse */ list_del(rme->list); list_add(&rme->list, &free_list); } The alloc function then becomes: if (list_empty(&free_list)) return list_empty(&limbo_list) ? -ENOSPC : -EBUSY; The switch_to() covers the task rmids. The per cpu default rmids can be marked at the point where they are installed on a CPU in the per cpu rmid_bitmap. The free path is the same for per task and per cpu. Another thing which needs some thought it the CPU hotplug code. We need to make sure that pending work which is scheduled on an outgoing CPU is moved in the offline callback to a still online CPU of the same domain and not moved to some random CPU by the workqueue hotplug code. There is another subtle issue. Assume a RMID is freed. The limbo stuff is scheduled on all domains which have online CPUs. Now the last CPU of a domain goes offline before the threshold for clearing the domain CPU bit in the rme->mask is reached. So we have two options here: 1) Clear the bit unconditionally when the last CPU of a domain goes offline. 2) Arm a timer which clears the bit after a grace period #1 The RMID might become available for reuse right away because all other domains have not used it or have cleared their bits already. If one of the CPUs of that domain comes online again and is associated to that reused RMID again, then the counter content might still contain leftovers from the previous usage. #2 Prevents #1 but has it's own issues vs. serialization and coordination with CPU hotplug. I'd say we go for #1 as the simplest solution, document it and if really the need arises revisit it later. Thanks, tglx