Subject: Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups
From: Alexey Budankov <alexey.budankov@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@intel.com>,
        Valery Cherepennikov <valery.cherepennikov@intel.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Stephane Eranian <eranian@google.com>,
        David Carrillo-Cisneros <davidcc@google.com>,
        linux-kernel <linux-kernel@vger.kernel.org>
References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com>
 <d98d9399-7292-e14d-5988-adc474f4e54a@linux.intel.com>
 <20170803130002.oatczvnaalplrsep@hirez.programming.kicks-ass.net>
 <86cbe0b0-a1ec-4d5f-addc-87bccf2e97d7@linux.intel.com>
 <20170804143628.34c2xqxl2e6k2arj@hirez.programming.kicks-ass.net>
 <9d2e25c9-209c-f28a-d601-d3f1a71f032f@linux.intel.com>
 <c882a2d5-4b76-e838-f812-0b4f70b0e3f7@linux.intel.com>
Organization: Intel Corp.
Message-ID: <cfc538bd-fa6a-9bf6-7d74-739926f3fe54@linux.intel.com>
Date: Thu, 31 Aug 2017 13:12:44 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <c882a2d5-4b76-e838-f812-0b4f70b0e3f7@linux.intel.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6142
Lines: 221

On 15.08.2017 20:28, Alexey Budankov wrote:
> Hi Peter,
> 
> On 07.08.2017 10:17, Alexey Budankov wrote:
>> On 04.08.2017 17:36, Peter Zijlstra wrote:
>>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>>>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>>>
>>>>>> +/*
>>>>>> + * Find group list by a cpu key and rotate it.
>>>>>> + */
>>>>>> +static void
>>>>>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>>>>>> +{
>>>>>> +	struct rb_node *node;
>>>>>> +	struct perf_event *node_event;
>>>>>> +
>>>>>> +	node = groups->rb_node;
>>>>>> +
>>>>>> +	while (node) {
>>>>>> +		node_event = container_of(node,
>>>>>> +				struct perf_event, group_node);
>>>>>> +
>>>>>> +		if (cpu < node_event->cpu) {
>>>>>> +			node = node->rb_left;
>>>>>> +		} else if (cpu > node_event->cpu) {
>>>>>> +			node = node->rb_right;
>>>>>> +		} else {
>>>>>> +			list_rotate_left(&node_event->group_list);
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	}
>>>>>> +}
>>>>>
>>>>> Ah, you worry about how to rotate inside a tree?
>>>>
>>>> Exactly.
>>>>
>>>>>
>>>>> You can do that by adding (run)time based ordering, and you'll end up
>>>>> with a runtime based scheduler.
>>>>
>>>> Do you mean replacing a CPU indexed rb_tree of lists with 
>>>> an CPU indexed rb_tree of counter indexed rb_trees?
>>>
>>> No, single tree, just more complicated ordering rules.
>>>
>>>>> A trivial variant keeps a simple counter per tree that is incremented
>>>>> for each rotation. That should end up with the events ordered exactly
>>>>> like the list. And if you have that comparator like above, expressing
>>>>> that additional ordering becomes simple ;-)
>>>>>
>>>>> Something like:
>>>>>
>>>>> struct group {
>>>>>   u64 vtime;
>>>>>   rb_tree tree;
>>>>> };
>>>>>
>>>>> bool event_less(left, right)
>>>>> {
>>>>>   if (left->cpu < right->cpu)
>>>>>     return true;
>>>>>
>>>>>   if (left->cpu > right_cpu)
>>>>>     return false;
>>>>>
>>>>>   if (left->vtime < right->vtime)
>>>>>     return true;
>>>>>
>>>>>   return false;
>>>>> }
>>>>>
>>>>> insert_group(group, event, tail)
>>>>> {
>>>>>   if (tail)
>>>>>     event->vtime = ++group->vtime;
>>>>>
>>>>>   tree_insert(&group->root, event);
>>>>> }
>>>>>
>>>>> Then every time you use insert_group(.tail=1) it goes to the end of that
>>>>> CPU's 'list'.
>>>>>
>>>>
>>>> Could you elaborate more on how to implement rotation?
>>>
>>> Its almost all there, but let me write a complete replacement for your
>>> perf_event_group_rotate() above.
>>>
>>> /* find the leftmost event matching @cpu */
>>> /* XXX not sure how to best parametrise a subtree search, */
>>> /* again, C sucks... */
>>> struct perf_event *__group_find_cpu(group, cpu)
>>> {
>>> 	struct rb_node *node = group->tree.rb_node;
>>> 	struct perf_event *event, *match = NULL;
>>>
>>> 	while (node) {
>>> 		event = container_of(node, struct perf_event, group_node);
>>>
>>> 		if (cpu > event->cpu) {
>>> 			node = node->rb_right;
>>> 		} else if (cpu < event->cpu) {
>>> 			node = node->rb_left;
>>> 		} else {
>>> 			/*
>>> 			 * subtree match, try left subtree for a
>>> 			 * 'smaller' match.
>>> 			 */
>>> 			match = event;
>>> 			node = node->rb_left;
>>> 		}
>>> 	}
>>>
>>> 	return match;
>>> }
>>>
>>> void perf_event_group_rotate(group, int cpu)
>>> {
>>> 	struct perf_event *event = __group_find_cpu(cpu);
>>>
>>> 	if (!event)
>>> 		return;
>>>
>>> 	tree_delete(&group->tree, event);
>>> 	insert_group(group, event, 1);
>>> }
>>>
>>> So we have a tree ordered by {cpu,vtime} and what we do is find the
>>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
>>> then take it out and re-insert it with a vtime number larger than any
>>> other, which places it as the rightmost entry for that cpu.
>>>
>>>
>>> So given:
>>>
>>>        {1,1}
>>>        / \
>>>     {0,5} {1,2}
>>>    / \        \
>>> {0,1} {0,6}  {1,4}
>>>
>>>
>>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
>>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
>>> resulting in something like:
>>>
>>>        {1,2}
>>>        / \
>>>     {0,5} {1,4}
>>>    / \        \
>>> {0,1} {0,6}  {1,7}
>>>
>>
>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>> may be several rotations of potentially big tree on the critical path, 
>> instead of updating four pointers in case of the tree of lists.
> 
> I implemented the approach you had suggested (as I understood it),
> tested it and got results that are drastically different from what 
> I am getting for the tree of lists. Specifically I did:
> 
> 1. keeping all groups in the same single tree by employing a 64-bit index
>    additionally to CPU key;
>    
> 2. implementing special _less() function and rotation by re-inserting
>    group with incremented index;
> 
> 3. replacing API with a callback in the signature by a macro
>    perf_event_groups_for_each();
> 
> Employing all that shrunk the total patch size, however I am still 
> struggling with the correctness issues.
> 
> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
> 
>      02
>     /  \
>    01  14
>       /  \
>      03  15
>            \
>            16
> 
> and it is unclear how to iterate cpu==0 part of tree in this case.
> 
> Iterating cpu specific subtree like this:
> 
> #define for_each_group_event(event, group, cpu, pmu, field)	 \
> 	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
> 				   typeof(*event), field);	 \
> 	     event && event->cpu == cpu && event->pmu == pmu;	 \
> 	     event = rb_entry_safe(rb_next(&event->field),	 \
> 				   typeof(*event), field))
> 
> misses event==03 for the case above and I guess this is where I loose 
> samples in my testing. 

I eventually managed to overcome difficulties with implementation
of rb_tree indexed by {cpu,index} for event groups so please 
see patches v9.

> 
> Please advise how to proceed.
> 
> Thanks,
> Alexey
> 
>>
>>>
>>>
>>>
>>>
>>
>>
> 
>