Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751537AbdHaKMw (ORCPT ); Thu, 31 Aug 2017 06:12:52 -0400 Received: from mga07.intel.com ([134.134.136.100]:34587 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750918AbdHaKMt (ORCPT ); Thu, 31 Aug 2017 06:12:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,451,1498546800"; d="scan'208";a="1212924741" Subject: Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups From: Alexey Budankov To: Peter Zijlstra Cc: Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> <20170803130002.oatczvnaalplrsep@hirez.programming.kicks-ass.net> <86cbe0b0-a1ec-4d5f-addc-87bccf2e97d7@linux.intel.com> <20170804143628.34c2xqxl2e6k2arj@hirez.programming.kicks-ass.net> <9d2e25c9-209c-f28a-d601-d3f1a71f032f@linux.intel.com> Organization: Intel Corp. Message-ID: Date: Thu, 31 Aug 2017 13:12:44 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6142 Lines: 221 On 15.08.2017 20:28, Alexey Budankov wrote: > Hi Peter, > > On 07.08.2017 10:17, Alexey Budankov wrote: >> On 04.08.2017 17:36, Peter Zijlstra wrote: >>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote: >>>> On 03.08.2017 16:00, Peter Zijlstra wrote: >>>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote: >>> >>>>>> +/* >>>>>> + * Find group list by a cpu key and rotate it. >>>>>> + */ >>>>>> +static void >>>>>> +perf_event_groups_rotate(struct rb_root *groups, int cpu) >>>>>> +{ >>>>>> + struct rb_node *node; >>>>>> + struct perf_event *node_event; >>>>>> + >>>>>> + node = groups->rb_node; >>>>>> + >>>>>> + while (node) { >>>>>> + node_event = container_of(node, >>>>>> + struct perf_event, group_node); >>>>>> + >>>>>> + if (cpu < node_event->cpu) { >>>>>> + node = node->rb_left; >>>>>> + } else if (cpu > node_event->cpu) { >>>>>> + node = node->rb_right; >>>>>> + } else { >>>>>> + list_rotate_left(&node_event->group_list); >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> +} >>>>> >>>>> Ah, you worry about how to rotate inside a tree? >>>> >>>> Exactly. >>>> >>>>> >>>>> You can do that by adding (run)time based ordering, and you'll end up >>>>> with a runtime based scheduler. >>>> >>>> Do you mean replacing a CPU indexed rb_tree of lists with >>>> an CPU indexed rb_tree of counter indexed rb_trees? >>> >>> No, single tree, just more complicated ordering rules. >>> >>>>> A trivial variant keeps a simple counter per tree that is incremented >>>>> for each rotation. That should end up with the events ordered exactly >>>>> like the list. And if you have that comparator like above, expressing >>>>> that additional ordering becomes simple ;-) >>>>> >>>>> Something like: >>>>> >>>>> struct group { >>>>> u64 vtime; >>>>> rb_tree tree; >>>>> }; >>>>> >>>>> bool event_less(left, right) >>>>> { >>>>> if (left->cpu < right->cpu) >>>>> return true; >>>>> >>>>> if (left->cpu > right_cpu) >>>>> return false; >>>>> >>>>> if (left->vtime < right->vtime) >>>>> return true; >>>>> >>>>> return false; >>>>> } >>>>> >>>>> insert_group(group, event, tail) >>>>> { >>>>> if (tail) >>>>> event->vtime = ++group->vtime; >>>>> >>>>> tree_insert(&group->root, event); >>>>> } >>>>> >>>>> Then every time you use insert_group(.tail=1) it goes to the end of that >>>>> CPU's 'list'. >>>>> >>>> >>>> Could you elaborate more on how to implement rotation? >>> >>> Its almost all there, but let me write a complete replacement for your >>> perf_event_group_rotate() above. >>> >>> /* find the leftmost event matching @cpu */ >>> /* XXX not sure how to best parametrise a subtree search, */ >>> /* again, C sucks... */ >>> struct perf_event *__group_find_cpu(group, cpu) >>> { >>> struct rb_node *node = group->tree.rb_node; >>> struct perf_event *event, *match = NULL; >>> >>> while (node) { >>> event = container_of(node, struct perf_event, group_node); >>> >>> if (cpu > event->cpu) { >>> node = node->rb_right; >>> } else if (cpu < event->cpu) { >>> node = node->rb_left; >>> } else { >>> /* >>> * subtree match, try left subtree for a >>> * 'smaller' match. >>> */ >>> match = event; >>> node = node->rb_left; >>> } >>> } >>> >>> return match; >>> } >>> >>> void perf_event_group_rotate(group, int cpu) >>> { >>> struct perf_event *event = __group_find_cpu(cpu); >>> >>> if (!event) >>> return; >>> >>> tree_delete(&group->tree, event); >>> insert_group(group, event, 1); >>> } >>> >>> So we have a tree ordered by {cpu,vtime} and what we do is find the >>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We >>> then take it out and re-insert it with a vtime number larger than any >>> other, which places it as the rightmost entry for that cpu. >>> >>> >>> So given: >>> >>> {1,1} >>> / \ >>> {0,5} {1,2} >>> / \ \ >>> {0,1} {0,6} {1,4} >>> >>> >>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry >>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert. >>> resulting in something like: >>> >>> {1,2} >>> / \ >>> {0,5} {1,4} >>> / \ \ >>> {0,1} {0,6} {1,7} >>> >> >> Makes sense. The implementation becomes a bit simpler. The drawbacks >> may be several rotations of potentially big tree on the critical path, >> instead of updating four pointers in case of the tree of lists. > > I implemented the approach you had suggested (as I understood it), > tested it and got results that are drastically different from what > I am getting for the tree of lists. Specifically I did: > > 1. keeping all groups in the same single tree by employing a 64-bit index > additionally to CPU key; > > 2. implementing special _less() function and rotation by re-inserting > group with incremented index; > > 3. replacing API with a callback in the signature by a macro > perf_event_groups_for_each(); > > Employing all that shrunk the total patch size, however I am still > struggling with the correctness issues. > > Now I figured that not all indexed events are always located under > the root with the same cpu, and it depends on the order of insertion > e.g. with insertion order 01,02,03,14,15,16 we get this: > > 02 > / \ > 01 14 > / \ > 03 15 > \ > 16 > > and it is unclear how to iterate cpu==0 part of tree in this case. > > Iterating cpu specific subtree like this: > > #define for_each_group_event(event, group, cpu, pmu, field) \ > for (event = rb_entry_safe(group_first(group, cpu, pmu), \ > typeof(*event), field); \ > event && event->cpu == cpu && event->pmu == pmu; \ > event = rb_entry_safe(rb_next(&event->field), \ > typeof(*event), field)) > > misses event==03 for the case above and I guess this is where I loose > samples in my testing. I eventually managed to overcome difficulties with implementation of rb_tree indexed by {cpu,index} for event groups so please see patches v9. > > Please advise how to proceed. > > Thanks, > Alexey > >> >>> >>> >>> >>> >> >> > >