Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755949Ab1DGMGV (ORCPT ); Thu, 7 Apr 2011 08:06:21 -0400 Received: from mail-vx0-f174.google.com ([209.85.220.174]:52977 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755775Ab1DGMGT (ORCPT ); Thu, 7 Apr 2011 08:06:19 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=kcJV1fQt6T9aaId8K/2f1zgrOtXQ+Q82eTZUUKiFktRscTlk72QFmF47bNje1sOw3E RQ2/j5n51wtMaa7Wn35Muh53dVz6XjrBDB+VCehn9CNYn0AYWK6duuqIQpphkFu5jg4q JjUh9Jg1sxcJ9plPdQqbBj8SzsAe3X83JBSjE= Date: Thu, 7 Apr 2011 14:06:11 +0200 From: Frederic Weisbecker To: Vaibhav Nagarnaik Cc: Paul Menage , Li Zefan , Stephane Eranian , Andrew Morton , Steven Rostedt , David Sharp , Michael Rubin , Ken Chen , linux-kernel@vger.kernel.org, containers@lists.linux-foundation.org Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality Message-ID: <20110407120608.GB1798@nowhere> References: <20110407013349.GH1867@nowhere> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5671 Lines: 111 On Wed, Apr 06, 2011 at 08:17:33PM -0700, Vaibhav Nagarnaik wrote: > On Wed, Apr 6, 2011 at 6:33 PM, Frederic Weisbecker wrote: > > On Wed, Apr 06, 2011 at 11:50:21AM -0700, Vaibhav Nagarnaik wrote: > >> All > >> The cgroup functionality is being used widely in different scenarios. It also > >> is being integrated with other parts of kernel to take advantage of its > >> features. One of the areas that is not yet aware of cgroup functionality is > >> the ftrace framework. > >> > >> Although ftrace provides a way to filter based on PIDs of tasks to be traced, > >> it is restricted to specific tracers, like function tracer. Also it becomes > >> difficult to keep track of all PIDs in a dynamic environment with processes > >> being created and destroyed in a short amount of time. > >> > >> An application that creates many processes/tasks is convenient to track and > >> control with cgroups, but it is difficult to track these processes for the > >> purposes of tracing. And if child processes are moved to another cgroup, it > >> makes sense to trace only the original cgroup. > >> > >> This proposal is to create a file in the tracing directory called > >> set_trace_cgroup to which a user can write the path of an active cgroup, one > >> at a time. If no cgroups are specified, no filtering is done and all tasks are > >> traced. When a cgroup path is added in, it sets a boolean tracing_enabled for > >> the enabled cgroup in all the hierarchies, which enables tracing for all the > >> assigned tasks under the specified cgroup. > >> > >> Though creating a new file in the directory is not desirable, but this > >> interface seems the most appropriate change required to implement the new > >> feature. > >> > >> This tracing_enabled flag is also exported in the cgroupfs directory structure > >> which can be turned on/off for a specific hierarchy/cgroup combination. This > >> gives control to enable/disable tracing over a cgroup in a specific hierarchy > >> only. > >> > >> This gives more fine-grained control over the tasks being traced. I would like > >> to know your thoughts on this interface and the approach to make tracing > >> cgroup aware. > > > > So I have to ask, why can't you use perf events to do tracing limited on cgroups? > > It has this cgroup context awareness. > > > > The perf event cgroup awareness comes from creating a different hierarchy for > perf events. When the events and the current task's cgroup match, the events > are logged. So the changes are pretty specific to the perf events. > > Even in the case where changes are made to handle trace events, the interface > files are still needed. The interface used to specify perf events uses the > perf_event syscall which isn't available to specify trace events. > > This is based on my limited understanding of the perf_events cgroup awareness > patch. Please correct me if I am missing anything. Ah but perf events can do much more than counting and sampling hardware events. Trace events can be used as perf events too. List the events: perf list -e tracepoints List of pre-defined events (to be used in -e): skb:kfree_skb [Tracepoint event] skb:consume_skb [Tracepoint event] skb:skb_copy_datagram_iovec [Tracepoint event] net:net_dev_xmit [Tracepoint event] net:net_dev_queue [Tracepoint event] net:netif_receive_skb [Tracepoint event] net:netif_rx [Tracepoint event] napi:napi_poll [Tracepoint event] scsi:scsi_dispatch_cmd_start [Tracepoint event] scsi:scsi_dispatch_cmd_error [Tracepoint event] scsi:scsi_dispatch_cmd_done [Tracepoint event] scsi:scsi_dispatch_cmd_timeout [Tracepoint event] scsi:scsi_eh_wakeup [Tracepoint event] drm:drm_vblank_event [Tracepoint event] drm:drm_vblank_event_queued [Tracepoint event] drm:drm_vblank_event_delivered [Tracepoint event] block:block_rq_abort [Tracepoint event] block:block_rq_requeue [Tracepoint event] block:block_rq_complete [Tracepoint event] block:block_rq_insert [Tracepoint event] etc... Trace sched switch events: perf record -e sched:sched_switch -a ^C Print them: perf script swapper 0 [000] 1132.964598: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm kworker/0:1 4358 [000] 1132.964641: sched_switch: prev_comm=kworker/0:1 prev_pid=4358 prev_prio=120 prev_state=S ==> ne syslogd 2703 [000] 1132.964720: sched_switch: prev_comm=syslogd prev_pid=2703 prev_prio=120 prev_state=D ==> next_c swapper 0 [000] 1132.965100: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm perf 4725 [001] 1132.965178: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm swapper 0 [001] 1132.965227: sched_switch: prev_comm=kworker/0:0 prev_pid=0 prev_prio=120 prev_state=R ==> next_ perf 4725 [001] 1132.965246: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm etc... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/