2013-08-19 09:42:06

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

Hi Steven,

I'm considering "Integrated trace" which is a trace merging system for a
virtualization environment. Why do we need this system? Because we want to
analyze latency problems for a virtualization environment. For example, a host
OS runs two guest OSs and those OSs are sharing HW devices like CPUs, NIC, disk,
etc. In the situation, it will be difficult to directly tackle I/O delay
problems. This is because we don't have any methods which all trace data are
shown. The integrated trace will solve this problem by merging trace data in
chronological order. Moreover, the integrated trace will support the feature
collecting trace data of guests and a host on the host.

I want to support above two big features in trace-cmd. However, trace-cmd does
not have following features yet:

a) server and client for a virtualization environment
b) structured message platform between guests and a host
c) agent feature of a client
d) merge feature of trace data of multiple guests and a host in chronological
order

This patch set supports above a) and b) features.

<overall view>

+------------+ +------------+
Guest | a), c) | | a), c) | client/agent
^ +------------+ +------------+
| ^ ^ ^ ^
============|===|=================|===|===========
| v b)v v b)v
v +----------------------------------+
Host | a) | server
+----------------------------------+
||output || ||
\/ \/ \/
/--------+ /--------+ /--------+
| 010101 | | 101010 | | 100101 | binary data
| 010100 | | 010100 | | 110011 |
+--------+ +--------+ +--------+
\ /
\-----------------------------------/
|| d)
\/
/-----------------------------------+
| (guest1) 123456: sched_switch... | text data
| (guest2) 123458: kmem_free... |
| (host) 123500: kvm_exit (guest1)|
| (host) 123510: kvm_entry(guest1)|
| (guest1) 123550: sched_switch... |
+-----------------------------------+

a) server and client for a virtualization environment
trace-cmd has listen mode for network, but using network will be a high cost
operation for inducing a lot of memory copying. From kernel-3.6, the
virtio-console driver supports splice_write and ftrace supports "steal" for
fops. So, guest clients of trace-cmd can send trace data without copying memory
by using splice(2). If guest clients use virtio-serial, the server also needs to
support virtio-serial I/F.

b) structured message platform between guests and a host
Currently, a server(clients) sends unstructured character string to
clients(server), so clients(server) must parse the unstructured messages.
Since it is hard to add complex contents in the protocol, structured binary
message trace-msg is introduced as the communication protocol.

c) agent feature of a client
Current trace-cmd client can operate only as "record" mode, so the client
will send trace data to the server immediately. However, when an user tries to
collect trace data of multiple guests on a host, the user must log in to
each guest. This is hard to use, I think. So, trace-cmd client had better
support agent mode which receives a message from the server.

d) merge feature of trace data of multiple guests and a host in chronological
order
Current trace-cmd cannot merge trace data of multiple guests and a host in
chronological order. If an user wants to analyze an I/O delay problem of a
guest, the user will want to check trace data of all guests and the host in a
file. However, trace-cmd does not support a merge feature yet, the user must
make a merging script. So, trace-cmd had better support a merge feature for
multiple files for virtualization.

For a), this patch introduces "virt-server" and "record --virt" modes for
achieving low-overhead communication of trace data of guests. "virt-server" is a
server mode for collecting trace data of guests. On the other hand,
"record --virt" mode is a guest client for sending trace data of the guest.
Although these functions are similar to "listen" and "record -N" modes each,
these do not use network but use virtio-serial for low-overhead communication.

For b), this patch series introduce specific message protocol in order to handle
communication messages with 8 commands. When we extend any messages, using
structured message will be easier than using unstructured message.

<How to use>
1. Run virt-server on a host
# trace-cmd virt-server

2. Make guest domain directory
# mkdir -p /tmp/trace-cmd/virt/<domain>
# chmod 710 /tmp/trace-cmd/virt/<domain>
# chgrp qemu /tmp/trace-cmd/virt/<domain>

3. Make FIFO on the host
# mkfifo /tmp/trace-cmd/virt/<domain>/trace-path-cpu{0,1,...,X}.{in,out}

4. Set up of virtio-serial pipe of a guest on the host
Add the following tags to domain XML files.
# virsh edit <domain>
<channel type='unix'>
<source mode='connect' path='/tmp/trace-cmd/virt/agent-ctl-path'/>
<target type='virtio' name='agent-ctl-path'/>
</channel>
<channel type='pipe'>
<source path='/tmp/trace-cmd/virt/<domain>/trace-path-cpu0'/>
<target type='virtio' name='trace-path-cpu0'/>
</channel>
... (cpu1, cpu2, ...)

5. Boot the guest
# virsh start <domain>

6. Execute "record --virt" on the guest
# trace-cmd record --virt -e sched*

<Result>
I measured CPU usage outputted by top command on a guest when client sends
trace data. Client means "record -N"(NW) or "record --virt"(virtio-serial).

NW virtio-serial(splice)
client(fedora19) ~2.9[%] ~1.7[%]

<Future work>
- Add an agent mode based on "record --virt"
- Add a merging feature of trace data of guests and host to "report"

I need your comments!
Thank you,

---

Yoshihiro YUNOMAE (11):
[TRIVIAL] trace-cmd: Delete the variable iface in trace-listen
[BUGFIX] trace-cmd: Add waitpid() when recorders are destoried
[BUGFIX]trace-cmd: Quit from splice(read) if there are no data
[CLEANUP] trace-cmd: Split out the communication with listener from setup_network()
[CLEANUP] trace-cmd: Split out the connect waiting loop from do_listen()
[CLEANUP] trace-cmd: Split out the communication with client from process_client()
[CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()
trace-cmd: Apply the trace-msg protocol for communication between a server and clients
trace-cmd: Use poll(2) to wait for a message
trace-cmd: Add virt-server mode for a virtualization environment
trace-cmd: Add --virt option for record mode


Documentation/trace-cmd-record.1.txt | 11
Documentation/trace-cmd-virt-server.1.txt | 89 +++
Makefile | 2
trace-cmd.c | 3
trace-cmd.h | 15 +
trace-listen.c | 690 ++++++++++++++++-------
trace-msg.c | 864 +++++++++++++++++++++++++++++
trace-msg.h | 25 +
trace-output.c | 4
trace-record.c | 138 ++---
trace-recorder.c | 59 +-
trace-usage.c | 10
12 files changed, 1602 insertions(+), 308 deletions(-)
create mode 100644 Documentation/trace-cmd-virt-server.1.txt
create mode 100644 trace-msg.c
create mode 100644 trace-msg.h

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]


2013-08-19 09:42:10

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 01/11] [TRIVIAL] trace-cmd: Delete the variable iface in trace-listen

iface in trace-listen is not used any more, so it is deleted.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-listen.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/trace-listen.c b/trace-listen.c
index 8503b50..dec1c00 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -653,7 +653,6 @@ void trace_listen(int argc, char **argv)
{
char *logfile = NULL;
char *port = NULL;
- char *iface;
int daemon = 0;
int c;

@@ -672,7 +671,7 @@ void trace_listen(int argc, char **argv)
{NULL, 0, NULL, 0}
};

- c = getopt_long (argc-1, argv+1, "+hp:o:d:i:l:D",
+ c = getopt_long (argc-1, argv+1, "+hp:o:d:l:D",
long_options, &option_index);
if (c == -1)
break;
@@ -683,9 +682,6 @@ void trace_listen(int argc, char **argv)
case 'p':
port = optarg;
break;
- case 'i':
- iface = optarg;
- break;
case 'd':
output_dir = optarg;
break;

2013-08-19 09:42:15

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 03/11] [BUGFIX]trace-cmd: Quit from splice(read) if there are no data

Quit from splice(read) for avoiding to do splice(write) if there are no data.
When splice(read) returns negative or 0, that means no data. So, the recorder
does not need to do splice(write).

Note:
This patch is related to https://lkml.org/lkml/2013/7/21/200. If the patch of
the link is not applied, splice(write) for virtio-serial induces kernel oops
when there are no data. This patch always avoids to do splice(write) in the
case, so this patch is needed.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-recorder.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/trace-recorder.c b/trace-recorder.c
index 7db5f3e..520d486 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -337,7 +337,10 @@ static long splice_data(struct tracecmd_recorder *recorder)
warning("recorder error in splice input");
return -1;
}
- }
+ if (errno == EINTR)
+ return 0;
+ } else if (ret == 0)
+ return 0;

ret = splice(recorder->brass[0], NULL, recorder->fd, NULL,
recorder->page_size, 3 /* and NON_BLOCK */);

2013-08-19 09:42:26

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 07/11] [CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()

Split out binding a port and fork reader from open_udp() for avoiding duplicate
codes between listen mode and virt-server mode.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-listen.c | 34 ++++++++++++++++++++++++++--------
1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/trace-listen.c b/trace-listen.c
index f29dd35..bf9ef9d 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -228,13 +228,12 @@ static void process_udp_child(int sfd, const char *host, const char *port,
#define START_PORT_SEARCH 1500
#define MAX_PORT_SEARCH 6000

-static int open_udp(const char *node, const char *port, int *pid,
- int cpu, int pagesize, int start_port)
+static int udp_bind_a_port(int start_port, int *sfd)
{
struct addrinfo hints;
struct addrinfo *result, *rp;
- int sfd, s;
char buf[BUFSIZ];
+ int s;
int num_port = start_port;

again:
@@ -250,15 +249,15 @@ static int open_udp(const char *node, const char *port, int *pid,
pdie("getaddrinfo: error opening udp socket");

for (rp = result; rp != NULL; rp = rp->ai_next) {
- sfd = socket(rp->ai_family, rp->ai_socktype,
- rp->ai_protocol);
- if (sfd < 0)
+ *sfd = socket(rp->ai_family, rp->ai_socktype,
+ rp->ai_protocol);
+ if (*sfd < 0)
continue;

- if (bind(sfd, rp->ai_addr, rp->ai_addrlen) == 0)
+ if (bind(*sfd, rp->ai_addr, rp->ai_addrlen) == 0)
break;

- close(sfd);
+ close(*sfd);
}

if (rp == NULL) {
@@ -270,6 +269,12 @@ static int open_udp(const char *node, const char *port, int *pid,

freeaddrinfo(result);

+ return num_port;
+}
+
+static void fork_udp_reader(int sfd, const char *node, const char *port,
+ int *pid, int cpu, int pagesize)
+{
*pid = fork();

if (*pid < 0)
@@ -279,6 +284,19 @@ static int open_udp(const char *node, const char *port, int *pid,
process_udp_child(sfd, node, port, cpu, pagesize);

close(sfd);
+}
+
+static int open_udp(const char *node, const char *port, int *pid,
+ int cpu, int pagesize, int start_port)
+{
+ int sfd;
+ int num_port;
+
+ num_port = udp_bind_a_port(start_port, &sfd);
+ if (num_port < 0)
+ return num_port;
+
+ fork_udp_reader(sfd, node, port, pid, cpu, pagesize);

return num_port;
}

2013-08-19 09:42:32

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 09/11] trace-cmd: Use poll(2) to wait for a message

Use poll(2) to wait for a message. If a client/server cannot send a message for
any reasons, the current server/client will wait in a blocking read operation.
So, we use poll(2) for avoiding remaining in a blocking state.

This modification avoids to try to connect to an old-version client. An
old-version network server will send a message "tracecmd" first, and an
old-version client will start to send cpu number, pagesize, and option.
By introducing trace-msg, the new client sends a message "MSG_TCONNECT" first,
so the start point of the connection is different. If the old client tries
to connect to the new network server, the server should make the connection
close. A server applied this patch will die when the predetermined time will
have elapsed.

Note that if a new client tries to connect to an old server, the new client
will automatically die. This is because the new client which received the
inappropriate message "tracecmd" will die.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-msg.c | 50 ++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/trace-msg.c b/trace-msg.c
index 777ea1f..36117cd 100644
--- a/trace-msg.c
+++ b/trace-msg.c
@@ -435,6 +435,27 @@ error:
return -ENOMSG;
}

+#define MSG_WAIT_MSEC 5000
+
+/*
+ * A return value of 0 indicates time-out
+ */
+static int tracecmd_msg_recv_wait(int fd, char *buf, struct tracecmd_msg **msg)
+{
+ struct pollfd pfd;
+ int ret;
+
+ pfd.fd = fd;
+ pfd.events = POLLIN;
+ ret = poll(&pfd, 1, MSG_WAIT_MSEC);
+ if (ret < 0) {
+ return -errno;
+ } else if (ret == 0)
+ return -ETIMEDOUT;
+
+ return tracecmd_msg_recv(fd, buf);
+}
+
static void *tracecmd_msg_buf_access(struct tracecmd_msg *msg, int offset)
{
return (void *)msg + offset;
@@ -448,9 +469,12 @@ static int tracecmd_msg_wait_for_msg(int fd, struct tracecmd_msg **msg)
u32 cmd;
int ret;

- ret = tracecmd_msg_recv(fd, msg_tmp);
- if (ret < 0)
+ ret = tracecmd_msg_recv_wait(fd, msg_tmp, msg);
+ if (ret < 0) {
+ if (ret == -ETIMEDOUT)
+ warning("Connection timed out\n");
return ret;
+ }

*msg = (struct tracecmd_msg *)msg_tmp;
cmd = ntohl((*msg)->cmd);
@@ -549,9 +573,13 @@ int tracecmd_msg_set_connection(int fd)
int ret;

/* wait for connection msg by a client first */
- ret = tracecmd_msg_recv(fd, buf);
+ ret = tracecmd_msg_recv_wait(fd, buf, &msg);
if (ret < 0) {
- warning("Disconnect");
+ if (ret == -ETIMEDOUT)
+ /* network connection will be started soon */
+ warning("No connection message");
+ else
+ warning("Disconnect");
return ret;
}

@@ -586,9 +614,12 @@ int tracecmd_msg_initial_setting(int fd, int *cpus, int *pagesize)
u32 size = 0;
u32 cmd;

- ret = tracecmd_msg_recv(fd, buf);
- if (ret < 0)
+ ret = tracecmd_msg_recv_wait(fd, buf, &msg);
+ if (ret < 0) {
+ if (ret == -ETIMEDOUT)
+ warning("Connection timed out\n");
return ret;
+ }

msg = (struct tracecmd_msg *)buf;
cmd = ntohl(msg->cmd);
@@ -724,9 +755,12 @@ int tracecmd_msg_collect_metadata(int ifd, int ofd)
int ret;

do {
- ret = tracecmd_msg_recv(ifd, buf);
+ ret = tracecmd_msg_recv_wait(ifd, buf, &msg);
if (ret < 0) {
- warning("reading client");
+ if (ret == -ETIMEDOUT)
+ warning("Connection timed out\n");
+ else
+ warning("reading client");
return ret;
}

2013-08-19 09:42:34

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

Apply trace-msg protocol for communication between a server and clients.

Currently, trace-listen(server) and trace-record -N(client) operate as follows:

<server> <client>
listen to socket fd
connect to socket fd
accept the client
send "tracecmd"
+------------> receive "tracecmd"
check "tracecmd"
send cpus
receive cpus <------------+
print "cpus=XXX"
+------------> send pagesize
|
receive pagesize <--------+
print "pagesize=XXX"
+------------> send option
|
receive option <----------+
understand option
send port_array
+------------> receive port_array
understand port_array
send meta data
receive meta data <-------+
record meta data
(snip)
read block
--- start sending trace data on child processes ---

--- When client finishes sending trace data ---
close(socket fd)
read size = 0
close(socket fd)

Currently, all messages are unstructured character strings, so server(client)
using the protocol must parse the unstructured messages. Since it is hard to
add complex contents in the protocol, structured binary message trace-msg
is introduced as the communication protocol.

By applying this patch, server and client operate as follows:

<server> <client>
listen to socket fd
connect to socket fd
accept the client
send a connection message(MSG_TCONNECT)
receive MSG_TCONNECT <----+
send "tracecmd"
+-MSG_RCONNECT-> receive MSG_RCONNECT
check "tracecmd"
send cpus,pagesize,option(MSG_TINIT)
receive MSG_TINIT <-------+
print "cpus=XXX"
print "pagesize=XXX"
understand option
send port_array
+--MSG_RINIT-> receive MSG_RINIT
understand port_array
send meta data(MSG_SENDMETA)
receive MSG_SENDMETA <----+
record meta data
(snip)
send a message to finish sending meta data
| (MSG_FINMETA)
receive MSG_FINMETA <-----+
read block
--- start sending trace data on child processes ---

--- When client finishes sending trace data ---
send MSG_CLOSE
receive MSG_CLOSE <-------+
close(socket fd) close(socket fd)

By introducing trace-msg protocol, when a client tries to connect to the server,
the client sends a message(MSG_TCONNECT) first. When client finishes sending
trace data, the client sends a message(MSG_CLOSE).

This message protocol is incompatible with the previous unstructured message
protocol. So, if an old(new)-version client tries to connect to an
new(old)-version server, the operation should be stopped.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
Makefile | 2
trace-cmd.h | 12 +
trace-listen.c | 144 +---------
trace-msg.c | 782 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
trace-msg.h | 21 ++
trace-output.c | 4
trace-record.c | 77 ------
7 files changed, 836 insertions(+), 206 deletions(-)
create mode 100644 trace-msg.c
create mode 100644 trace-msg.h

diff --git a/Makefile b/Makefile
index 51c6f39..cebe553 100644
--- a/Makefile
+++ b/Makefile
@@ -314,7 +314,7 @@ KERNEL_SHARK_OBJS = $(TRACE_VIEW_OBJS) $(TRACE_GRAPH_OBJS) $(TRACE_GUI_OBJS) \
PEVENT_LIB_OBJS = event-parse.o trace-seq.o parse-filter.o parse-utils.o
TCMD_LIB_OBJS = $(PEVENT_LIB_OBJS) trace-util.o trace-input.o trace-ftrace.o \
trace-output.o trace-recorder.o trace-restore.o trace-usage.o \
- trace-blk-hack.o kbuffer-parse.o
+ trace-blk-hack.o kbuffer-parse.o trace-msg.o

PLUGIN_OBJS = plugin_hrtimer.o plugin_kmem.o plugin_sched_switch.o \
plugin_mac80211.o plugin_jbd2.o plugin_function.o plugin_kvm.o \
diff --git a/trace-cmd.h b/trace-cmd.h
index cbbc6ed..5ae5313 100644
--- a/trace-cmd.h
+++ b/trace-cmd.h
@@ -248,6 +248,18 @@ void tracecmd_stop_recording(struct tracecmd_recorder *recorder);
void tracecmd_stat_cpu(struct trace_seq *s, int cpu);
long tracecmd_flush_recording(struct tracecmd_recorder *recorder);

+/* for clients */
+int tracecmd_msg_connect_to_server(int fd);
+int tracecmd_msg_metadata_send(int fd, char *buf, int size);
+int tracecmd_msg_finish_sending_metadata(int fd);
+void tracecmd_msg_send_close_msg();
+
+/* for server */
+int tracecmd_msg_set_connection(int fd);
+int tracecmd_msg_initial_setting(int fd, int *cpus, int *pagesize);
+int tracecmd_msg_send_port_array(int fd, int total_cpus, int *ports);
+int tracecmd_msg_collect_metadata(int ifd, int ofd);
+
/* --- Plugin handling --- */
extern struct plugin_option trace_ftrace_options[];

diff --git a/trace-listen.c b/trace-listen.c
index bf9ef9d..3cec10c 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -33,8 +33,7 @@
#include <errno.h>

#include "trace-local.h"
-
-#define MAX_OPTION_SIZE 4096
+#include "trace-msg.h"

static char *default_output_dir = ".";
static char *output_dir;
@@ -45,8 +44,6 @@ static FILE *logfp;

static int debug;

-static int use_tcp;
-
static int backlog = 5;

#define TEMP_FILE_STR "%s.%s:%s.cpu%d", output_file, host, port, cpu
@@ -88,34 +85,9 @@ static void delete_temp_file(const char *host, const char *port, int cpu)
unlink(file);
}

-static int read_string(int fd, char *buf, size_t size)
-{
- size_t i;
- int n;
-
- for (i = 0; i < size; i++) {
- n = read(fd, buf+i, 1);
- if (!buf[i] || n <= 0)
- break;
- }
-
- return i;
-}
-
-static int process_option(char *option)
-{
- /* currently the only option we have is to us TCP */
- if (strcmp(option, "TCP") == 0) {
- use_tcp = 1;
- return 1;
- }
- return 0;
-}
-
-static int done;
static void finish(int sig)
{
- done = 1;
+ done = true;
}

#define LOG_BUF_SIZE 1024
@@ -144,7 +116,7 @@ static void __plog(const char *prefix, const char *fmt, va_list ap,
fprintf(fp, "%.*s", r, buf);
}

-static void plog(const char *fmt, ...)
+void plog(const char *fmt, ...)
{
va_list ap;

@@ -153,7 +125,7 @@ static void plog(const char *fmt, ...)
va_end(ap);
}

-static void pdie(const char *fmt, ...)
+void pdie(const char *fmt, ...)
{
va_list ap;
char *str = "";
@@ -303,75 +275,14 @@ static int open_udp(const char *node, const char *port, int *pid,

static int communicate_with_client(int fd, int *cpus, int *pagesize)
{
- char buf[BUFSIZ];
- char *option;
- int options;
- int size;
- int n, s, t, i;
-
/* Let the client know what we are */
- write(fd, "tracecmd", 8);
-
- /* read back the CPU count */
- n = read_string(fd, buf, BUFSIZ);
- if (n == BUFSIZ)
- /** ERROR **/
- return -1;
-
- *cpus = atoi(buf);
-
- plog("cpus=%d\n", *cpus);
- if (*cpus < 0)
+ if (tracecmd_msg_set_connection(fd) < 0)
return -1;

- /* next read the page size */
- n = read_string(fd, buf, BUFSIZ);
- if (n == BUFSIZ)
- /** ERROR **/
+ /* read the CPU count, the page size, and options */
+ if (tracecmd_msg_initial_setting(fd, cpus, pagesize) < 0)
return -1;

- *pagesize = atoi(buf);
-
- plog("pagesize=%d\n", *pagesize);
- if (*pagesize <= 0)
- return -1;
-
- /* Now the number of options */
- n = read_string(fd, buf, BUFSIZ);
- if (n == BUFSIZ)
- /** ERROR **/
- return -1;
-
- options = atoi(buf);
-
- for (i = 0; i < options; i++) {
- /* next is the size of the options */
- n = read_string(fd, buf, BUFSIZ);
- if (n == BUFSIZ)
- /** ERROR **/
- return -1;
- size = atoi(buf);
- /* prevent a client from killing us */
- if (size > MAX_OPTION_SIZE)
- return -1;
- option = malloc_or_die(size);
- do {
- t = size;
- s = 0;
- s = read(fd, option+s, t);
- if (s <= 0)
- return -1;
- t -= s;
- s = size - t;
- } while (t);
-
- s = process_option(option);
- free(option);
- /* do we understand this option? */
- if (!s)
- return -1;
- }
-
if (use_tcp)
plog("Using TCP for live connection\n");

@@ -409,7 +320,6 @@ static void destroy_all_readers(int cpus, int *pid_array, const char *node,
static int *create_all_readers(int cpus, const char *node, const char *port,
int pagesize, int fd)
{
- char buf[BUFSIZ];
int *port_array;
int *pid_array;
int start_port;
@@ -438,14 +348,9 @@ static int *create_all_readers(int cpus, const char *node, const char *port,
start_port = udp_port + 1;
}

- /* send the client a comma deliminated set of port numbers */
- for (cpu = 0; cpu < cpus; cpu++) {
- snprintf(buf, BUFSIZ, "%s%d",
- cpu ? "," : "", port_array[cpu]);
- write(fd, buf, strlen(buf));
- }
- /* end with null terminator */
- write(fd, "\0", 1);
+ /* send set of port numbers to the client */
+ if (tracecmd_msg_send_port_array(fd, cpus, port_array) < 0)
+ goto out_free;

return pid_array;

@@ -454,33 +359,6 @@ static int *create_all_readers(int cpus, const char *node, const char *port,
return NULL;
}

-static void collect_metadata_from_client(int ifd, int ofd)
-{
- char buf[BUFSIZ];
- int n, s, t;
-
- do {
- n = read(ifd, buf, BUFSIZ);
- if (n < 0) {
- if (errno == EINTR)
- continue;
- pdie("reading client");
- }
- t = n;
- s = 0;
- do {
- s = write(ofd, buf + s, t);
- if (s < 0) {
- if (errno == EINTR)
- break;
- pdie("writing to file");
- }
- t -= s;
- s = n - t;
- } while (t);
- } while (n > 0 && !done);
-}
-
static void stop_all_readers(int cpus, int *pid_array)
{
int cpu;
@@ -524,7 +402,7 @@ static void process_client(const char *node, const char *port, int fd)
return;

/* Now we are ready to start reading data from the client */
- collect_metadata_from_client(fd, ofd);
+ tracecmd_msg_collect_metadata(fd, ofd);

/* wait a little to let our readers finish reading */
sleep(1);
diff --git a/trace-msg.c b/trace-msg.c
new file mode 100644
index 0000000..777ea1f
--- /dev/null
+++ b/trace-msg.c
@@ -0,0 +1,782 @@
+/*
+ * trace-msg.c : define message protocol for communication between clients and
+ * a server
+ *
+ * Copyright (C) 2013 Hitachi, Ltd.
+ * Created by Yoshihiro YUNOMAE <[email protected]>
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License (not later!)
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses>
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ */
+
+#include <errno.h>
+#include <poll.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <linux/types.h>
+
+#include "trace-cmd-local.h"
+#include "trace-msg.h"
+
+typedef __u32 u32;
+typedef __be32 be32;
+
+#define TRACECMD_MSG_MAX_LEN BUFSIZ
+
+ /* size + cmd */
+#define TRACECMD_MSG_HDR_LEN ((sizeof(be32)) + (sizeof(be32)))
+
+ /* + size of the metadata */
+#define TRACECMD_MSG_META_MIN_LEN \
+ ((TRACECMD_MSG_HDR_LEN) + (sizeof(be32)))
+
+ /* - header size for error msg */
+#define TRACECMD_MSG_META_MAX_LEN \
+((TRACECMD_MSG_MAX_LEN) - (TRACECMD_MSG_META_MIN_LEN) - TRACECMD_MSG_HDR_LEN)
+
+ /* size + opt_cmd + size of str */
+#define TRACECMD_OPT_MIN_LEN \
+ ((sizeof(be32)) + (sizeof(be32)) +(sizeof(be32)))
+
+
+#define UDP_MAX_PACKET (65536 - 20)
+#define CPU_MAX 256
+
+/* use CONNECTION_MSG as a protocol version of trace-msg */
+#define MSG_VERSION "v2"
+#define CONNECTION_MSG "tracecmd-msg-" MSG_VERSION
+#define CONNECTION_MSGSIZE sizeof(CONNECTION_MSG)
+
+/* for client */
+static int psfd;
+
+/* for server */
+static int *port_array;
+
+struct tracecmd_msg_str {
+ be32 size;
+ char *buf;
+} __attribute__((packed));
+
+struct tracecmd_msg_rconnect {
+ struct tracecmd_msg_str str;
+};
+
+struct tracecmd_msg_opt {
+ be32 size;
+ be32 opt_cmd;
+ struct tracecmd_msg_str str;
+};
+
+struct tracecmd_msg_tinit {
+ be32 cpus;
+ be32 page_size;
+ be32 opt_num;
+ struct tracecmd_msg_opt *opt;
+} __attribute__((packed));
+
+struct tracecmd_msg_rinit {
+ be32 cpus;
+ be32 port_array[CPU_MAX];
+} __attribute__((packed));
+
+struct tracecmd_msg_meta {
+ struct tracecmd_msg_str str;
+};
+
+struct tracecmd_msg_error {
+ be32 size;
+ be32 cmd;
+ union {
+ struct tracecmd_msg_rconnect rconnect;
+ struct tracecmd_msg_tinit tinit;
+ struct tracecmd_msg_rinit rinit;
+ struct tracecmd_msg_meta meta;
+ } data;
+} __attribute__((packed));
+
+enum tracecmd_msg_cmd {
+ MSG_ERROR = 0,
+ MSG_CLOSE = 1,
+ MSG_TCONNECT = 2,
+ MSG_RCONNECT = 3,
+ MSG_TINIT = 4,
+ MSG_RINIT = 5,
+ MSG_SENDMETA = 6,
+ MSG_FINMETA = 7,
+};
+
+struct tracecmd_msg {
+ be32 size;
+ be32 cmd;
+ union {
+ struct tracecmd_msg_rconnect rconnect;
+ struct tracecmd_msg_tinit tinit;
+ struct tracecmd_msg_rinit rinit;
+ struct tracecmd_msg_meta meta;
+ struct tracecmd_msg_error err;
+ } data;
+} __attribute__((packed));
+
+struct tracecmd_msg *errmsg;
+
+static ssize_t msg_do_write_check(int fd, struct tracecmd_msg *msg)
+{
+ return __do_write_check(fd, msg, ntohl(msg->size));
+}
+
+static struct tracecmd_msg *tracecmd_msg_alloc(u32 size)
+{
+ size += TRACECMD_MSG_HDR_LEN;
+ return malloc(size);
+}
+
+static void tracecmd_msg_init(u32 cmd, u32 size, struct tracecmd_msg *msg)
+{
+ size += TRACECMD_MSG_HDR_LEN;
+ memset(msg, 0, size);
+ msg->size = htonl(size);
+ msg->cmd = htonl(cmd);
+}
+
+static void bufcpy(void *dest, u32 offset, const void *buf, u32 buflen)
+{
+ memcpy(dest + offset, buf, buflen);
+}
+
+static int make_rconnect(const char *buf, int buflen, struct tracecmd_msg *msg)
+{
+ u32 offset = offsetof(struct tracecmd_msg, data.rconnect.str.buf);
+
+ msg->data.rconnect.str.size = htonl(buflen);
+ bufcpy(msg, offset, buf, buflen);
+
+ return 0;
+}
+
+enum msg_opt_command {
+ MSGOPT_USETCP = 1,
+};
+
+static struct tracecmd_msg_opt *tracecmd_msg_opt_alloc(u32 len)
+{
+ len += TRACECMD_OPT_MIN_LEN;
+ return malloc(len);
+}
+
+static void make_option(int opt_cmd, const char *buf,
+ struct tracecmd_msg_opt *opt)
+{
+ u32 buflen = 0;
+ u32 size = TRACECMD_OPT_MIN_LEN;
+
+ if (buf) {
+ buflen = strlen(buf);
+ size += buflen;
+ }
+
+ opt->size = htonl(size);
+ opt->opt_cmd = htonl(opt_cmd);
+ opt->str.size = htonl(buflen);
+
+ if (buf)
+ bufcpy(opt, TRACECMD_OPT_MIN_LEN, buf, buflen);
+}
+
+static int add_options_to_tinit(u32 len, struct tracecmd_msg *msg)
+{
+ struct tracecmd_msg_opt *opt;
+ int offset = offsetof(struct tracecmd_msg, data.tinit.opt);
+
+ if (use_tcp) {
+ opt = tracecmd_msg_opt_alloc(0);
+ if (!opt)
+ return -ENOMEM;
+
+ make_option(MSGOPT_USETCP, NULL, opt);
+ /* add option */
+ bufcpy(msg, offset, opt, ntohl(opt->size));
+ free(opt);
+ }
+
+ return 0;
+}
+
+static int make_tinit(u32 len, struct tracecmd_msg *msg)
+{
+ int opt_num = 0;
+ int ret = 0;
+
+ if (use_tcp)
+ opt_num++;
+
+ if (opt_num) {
+ ret = add_options_to_tinit(len, msg);
+ if (ret < 0)
+ return ret;
+ }
+
+ msg->data.tinit.cpus = htonl(cpu_count);
+ msg->data.tinit.page_size = htonl(page_size);
+ msg->data.tinit.opt_num = htonl(opt_num);
+
+ return 0;
+}
+
+static int make_rinit(struct tracecmd_msg *msg)
+{
+ int i;
+ u32 offset = TRACECMD_MSG_HDR_LEN;
+ be32 port;
+
+ msg->data.rinit.cpus = htonl(cpu_count);
+
+ for (i = 0; i < cpu_count; i++) {
+ /* + rrqports->cpus or rrqports->port_array[i] */
+ offset += sizeof(be32);
+ port = htonl(port_array[i]);
+ bufcpy(msg, offset, &port, sizeof(be32) * cpu_count);
+ }
+
+ return 0;
+}
+
+static int make_error_msg(u32 len, struct tracecmd_msg *msg)
+{
+ bufcpy(msg, TRACECMD_MSG_HDR_LEN, errmsg, len);
+ return 0;
+}
+
+static u32 tracecmd_msg_get_body_length(u32 cmd)
+{
+ struct tracecmd_msg *msg;
+ u32 len = 0;
+
+ switch (cmd) {
+ case MSG_ERROR:
+ return ntohl(errmsg->size);
+ case MSG_RCONNECT:
+ return sizeof(msg->data.rconnect.str.size) + CONNECTION_MSGSIZE;
+ case MSG_TINIT:
+ len = sizeof(msg->data.tinit.cpus)
+ + sizeof(msg->data.tinit.page_size)
+ + sizeof(msg->data.tinit.opt_num);
+
+ /*
+ * If we are using IPV4 and our page size is greater than
+ * or equal to 64K, we need to punt and use TCP. :-(
+ */
+
+ /* TODO, test for ipv4 */
+ if (page_size >= UDP_MAX_PACKET) {
+ warning("page size too big for UDP using TCP "
+ "in live read");
+ use_tcp = true;
+ }
+
+ if (use_tcp)
+ len += TRACECMD_OPT_MIN_LEN;
+
+ return len;
+ case MSG_RINIT:
+ return sizeof(msg->data.rinit.cpus)
+ + sizeof(msg->data.rinit.port_array);
+ case MSG_SENDMETA:
+ return TRACECMD_MSG_MAX_LEN - TRACECMD_MSG_HDR_LEN;
+ case MSG_CLOSE:
+ case MSG_TCONNECT:
+ case MSG_FINMETA:
+ break;
+ }
+
+ return 0;
+}
+
+static int tracecmd_msg_make_body(u32 cmd, u32 len, struct tracecmd_msg *msg)
+{
+ switch (cmd) {
+ case MSG_ERROR:
+ return make_error_msg(len, msg);
+ case MSG_RCONNECT:
+ return make_rconnect(CONNECTION_MSG, CONNECTION_MSGSIZE, msg);
+ case MSG_TINIT:
+ return make_tinit(len, msg);
+ case MSG_RINIT:
+ return make_rinit(msg);
+ case MSG_CLOSE:
+ case MSG_TCONNECT:
+ case MSG_SENDMETA: /* meta data is not stored here. */
+ case MSG_FINMETA:
+ break;
+ }
+
+ return 0;
+}
+
+static int tracecmd_msg_create(u32 cmd, struct tracecmd_msg **msg)
+{
+ u32 len = 0;
+ int ret = 0;
+
+ len = tracecmd_msg_get_body_length(cmd);
+ if (len > (TRACECMD_MSG_MAX_LEN - TRACECMD_MSG_HDR_LEN)) {
+ plog("Exceed maximum message size cmd=%d\n", cmd);
+ return -EINVAL;
+ }
+
+ *msg = tracecmd_msg_alloc(len);
+ if (!*msg)
+ return -ENOMEM;
+ tracecmd_msg_init(cmd, len, *msg);
+
+ ret = tracecmd_msg_make_body(cmd, len, *msg);
+ if (ret < 0)
+ free(*msg);
+
+ return ret;
+}
+
+static int tracecmd_msg_send(int fd, u32 cmd)
+{
+ struct tracecmd_msg *msg = NULL;
+ int ret = 0;
+
+ if (cmd > MSG_FINMETA) {
+ plog("Unsupported command: %d\n", cmd);
+ return -EINVAL;
+ }
+
+ ret = tracecmd_msg_create(cmd, &msg);
+ if (ret < 0)
+ return ret;
+
+ ret = msg_do_write_check(fd, msg);
+ if (ret < 0) {
+ free(msg);
+ return -ECOMM;
+ }
+
+ return 0;
+}
+
+static void tracecmd_msg_send_error(int fd, struct tracecmd_msg *msg)
+{
+ errmsg = msg;
+ tracecmd_msg_send(fd, MSG_ERROR);
+}
+
+static int tracecmd_msg_read_extra(int fd, char *buf, u32 size, int *n)
+{
+ int r = 0;
+
+ do {
+ r = read(fd, buf + *n, size);
+ if (r < 0) {
+ if (errno == EINTR)
+ continue;
+ return -errno;
+ } else if (!r)
+ return -ENOTCONN;
+ size -= r;
+ *n += r;
+ } while (size);
+
+ return 0;
+}
+
+/*
+ * Read header information of msg first, then read all data
+ */
+static int tracecmd_msg_recv(int fd, char *buf)
+{
+ struct tracecmd_msg *msg;
+ u32 size = 0;
+ int n = 0;
+ int ret;
+
+ ret = tracecmd_msg_read_extra(fd, buf, TRACECMD_MSG_HDR_LEN, &n);
+ if (ret < 0)
+ return ret;
+
+ msg = (struct tracecmd_msg *)buf;
+ size = ntohl(msg->size);
+ if (size > TRACECMD_MSG_MAX_LEN)
+ /* too big */
+ goto error;
+ else if (size < TRACECMD_MSG_HDR_LEN)
+ /* too small */
+ goto error;
+ else if (size > TRACECMD_MSG_HDR_LEN) {
+ size -= TRACECMD_MSG_HDR_LEN;
+ return tracecmd_msg_read_extra(fd, buf, size, &n);
+ }
+
+ return 0;
+error:
+ plog("Receive an invalid message(size=%d)\n", size);
+ return -ENOMSG;
+}
+
+static void *tracecmd_msg_buf_access(struct tracecmd_msg *msg, int offset)
+{
+ return (void *)msg + offset;
+}
+
+static int tracecmd_msg_wait_for_msg(int fd, struct tracecmd_msg **msg)
+{
+ char msg_tmp[TRACECMD_MSG_MAX_LEN];
+ char *buf;
+ int offset = TRACECMD_MSG_HDR_LEN;
+ u32 cmd;
+ int ret;
+
+ ret = tracecmd_msg_recv(fd, msg_tmp);
+ if (ret < 0)
+ return ret;
+
+ *msg = (struct tracecmd_msg *)msg_tmp;
+ cmd = ntohl((*msg)->cmd);
+ switch (cmd) {
+ case MSG_RCONNECT:
+ offset += sizeof((*msg)->data.rconnect.str.size);
+ buf = tracecmd_msg_buf_access(*msg, offset);
+ /* Make sure the server is the tracecmd server */
+ if (memcmp(buf, CONNECTION_MSG,
+ ntohl((*msg)->data.rconnect.str.size) - 1) != 0) {
+ warning("server not tracecmd server");
+ return -EPROTONOSUPPORT;
+ }
+ break;
+ case MSG_CLOSE:
+ return -ECONNABORTED;
+ case MSG_ERROR:
+ plog("Receive error message: cmd=%d size=%d\n",
+ ntohl((*msg)->data.err.cmd),
+ ntohl((*msg)->data.err.size));
+ return -EBADMSG;
+ };
+
+ return 0;
+}
+
+int tracecmd_msg_connect_to_server(int fd)
+{
+ struct tracecmd_msg *msg;
+ u32 reply_cmd, cmd;
+ int i, ret, cpus;
+
+ /* connect to a server */
+ cmd = MSG_TCONNECT;
+
+ do {
+ ret = tracecmd_msg_send(fd, cmd);
+ if (ret < 0)
+ goto error;
+
+ ret = tracecmd_msg_wait_for_msg(fd, &msg);
+ if (ret < 0) {
+ if (ret == -EPROTONOSUPPORT)
+ goto error;
+ else
+ return ret;
+ }
+ reply_cmd = ntohl(msg->cmd);
+ cmd = reply_cmd + 1;
+ } while (reply_cmd != MSG_RINIT);
+
+ cpus = ntohl(msg->data.rinit.cpus);
+ client_ports = malloc_or_die(sizeof(int) * cpus);
+ for (i = 0; i < cpus; i++)
+ client_ports[i] = ntohl(msg->data.rinit.port_array[i]);
+
+ /* Next, send meta data */
+ send_metadata = true;
+
+ return 0;
+
+error:
+ tracecmd_msg_send_error(fd, msg);
+ return ret;
+}
+
+static bool process_option(struct tracecmd_msg_opt *opt)
+{
+ /* currently the only option we have is to us TCP */
+ if (ntohl(opt->opt_cmd) == MSGOPT_USETCP) {
+ use_tcp = true;
+ return true;
+ }
+ return false;
+}
+
+static void error_operation_for_server(struct tracecmd_msg *msg)
+{
+ u32 cmd;
+
+ cmd = ntohl(msg->cmd);
+
+ if (cmd == MSG_ERROR)
+ plog("Receive error message: cmd=%d size=%d\n",
+ ntohl(msg->data.err.cmd),
+ ntohl(msg->data.err.size));
+ else
+ warning("Message: cmd=%d size=%d\n", cmd, ntohl(msg->size));
+}
+
+int tracecmd_msg_set_connection(int fd)
+{
+ struct tracecmd_msg *msg;
+ char buf[TRACECMD_MSG_MAX_LEN];
+ u32 cmd;
+ int ret;
+
+ /* wait for connection msg by a client first */
+ ret = tracecmd_msg_recv(fd, buf);
+ if (ret < 0) {
+ warning("Disconnect");
+ return ret;
+ }
+
+ msg = (struct tracecmd_msg *)buf;
+ cmd = ntohl(msg->cmd);
+ if (cmd == MSG_CLOSE)
+ return -ECONNABORTED;
+ else if (cmd != MSG_TCONNECT)
+ return -EINVAL;
+
+ ret = tracecmd_msg_send(fd, MSG_RCONNECT);
+ if (ret < 0)
+ goto error;
+
+ return 0;
+
+error:
+ error_operation_for_server(msg);
+ return ret;
+}
+
+#define MAX_OPTION_SIZE 4096
+
+int tracecmd_msg_initial_setting(int fd, int *cpus, int *pagesize)
+{
+ struct tracecmd_msg *msg;
+ struct tracecmd_msg_opt *opt;
+ char buf[TRACECMD_MSG_MAX_LEN];
+ int offset = offsetof(struct tracecmd_msg, data.tinit.opt);
+ int options, i, s;
+ int ret;
+ u32 size = 0;
+ u32 cmd;
+
+ ret = tracecmd_msg_recv(fd, buf);
+ if (ret < 0)
+ return ret;
+
+ msg = (struct tracecmd_msg *)buf;
+ cmd = ntohl(msg->cmd);
+ if (cmd != MSG_TINIT) {
+ ret = -EINVAL;
+ goto error;
+ }
+
+ *cpus = ntohl(msg->data.tinit.cpus);
+ plog("cpus=%d\n", *cpus);
+ if (*cpus < 0) {
+ ret = -EINVAL;
+ goto error;
+ }
+
+ *pagesize = ntohl(msg->data.tinit.page_size);
+ plog("pagesize=%d\n", *pagesize);
+ if (*pagesize <= 0) {
+ ret = -EINVAL;
+ goto error;
+ }
+
+ options = ntohl(msg->data.tinit.opt_num);
+ for (i = 0; i < options; i++) {
+ offset += size;
+ opt = tracecmd_msg_buf_access(msg, offset);
+ size = ntohl(opt->size);
+ /* prevent a client from killing us */
+ if (size > MAX_OPTION_SIZE) {
+ plog("Exceed MAX_OPTION_SIZE\n");
+ ret = -EINVAL;
+ goto error;
+ }
+ s = process_option(opt);
+ /* do we understand this option? */
+ if (!s) {
+ plog("Cannot understand(%d:%d:%d)\n",
+ i, ntohl(opt->size), ntohl(opt->opt_cmd));
+ ret = -EINVAL;
+ goto error;
+ }
+ }
+
+ return 0;
+
+error:
+ error_operation_for_server(msg);
+ return ret;
+}
+
+int tracecmd_msg_send_port_array(int fd, int total_cpus, int *ports)
+{
+ int ret;
+
+ cpu_count = total_cpus;
+ port_array = ports;
+
+ ret = tracecmd_msg_send(fd, MSG_RINIT);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
+void tracecmd_msg_send_close_msg()
+{
+ tracecmd_msg_send(psfd, MSG_CLOSE);
+}
+
+static void make_meta(const char *buf, int buflen, struct tracecmd_msg *msg)
+{
+ int offset = offsetof(struct tracecmd_msg, data.meta.str.buf);
+
+ msg->data.meta.str.size = htonl(buflen);
+ bufcpy(msg, offset, buf, buflen);
+}
+
+int tracecmd_msg_metadata_send(int fd, char *buf, int size)
+{
+ struct tracecmd_msg *msg;
+ int n, len, ret;
+ int count = 0;
+
+ ret = tracecmd_msg_create(MSG_SENDMETA, &msg);
+ if (ret < 0)
+ return ret;
+
+ n = size;
+ do {
+ if (n > TRACECMD_MSG_META_MAX_LEN) {
+ make_meta(buf + count, TRACECMD_MSG_META_MAX_LEN,
+ msg);
+ n -= TRACECMD_MSG_META_MAX_LEN;
+ count += TRACECMD_MSG_META_MAX_LEN;
+ } else {
+ make_meta(buf + count, n, msg);
+ /*
+ * TRACECMD_MSG_META_MAX_LEN is stored in msg->size,
+ * so update the size to the correct value.
+ */
+ len = TRACECMD_MSG_META_MIN_LEN + n;
+ msg->size = htonl(len);
+ n = 0;
+ }
+
+ ret = msg_do_write_check(fd, msg);
+ if (ret < 0)
+ return ret;
+ } while (n);
+
+ return 0;
+}
+
+int tracecmd_msg_finish_sending_metadata(int fd)
+{
+ int ret;
+
+ ret = tracecmd_msg_send(fd, MSG_FINMETA);
+ if (ret < 0)
+ return ret;
+
+ /* psfd will be used for closing */
+ psfd = fd;
+ return 0;
+}
+
+int tracecmd_msg_collect_metadata(int ifd, int ofd)
+{
+ struct tracecmd_msg *msg;
+ char buf[TRACECMD_MSG_MAX_LEN];
+ u32 s, t, n, cmd;
+ int offset = TRACECMD_MSG_META_MIN_LEN;
+ int ret;
+
+ do {
+ ret = tracecmd_msg_recv(ifd, buf);
+ if (ret < 0) {
+ warning("reading client");
+ return ret;
+ }
+
+ msg = (struct tracecmd_msg *)buf;
+ cmd = ntohl(msg->cmd);
+ if (cmd == MSG_FINMETA) {
+ /* Finish receiving meta data */
+ break;
+ } else if (cmd != MSG_SENDMETA)
+ goto error;
+
+ n = ntohl(msg->data.meta.str.size);
+ t = n;
+ s = 0;
+ do {
+ s = write(ofd, buf + s + offset, t);
+ if (s < 0) {
+ if (errno == EINTR)
+ continue;
+ warning("writing to file");
+ return -errno;
+ }
+ t -= s;
+ s = n - t;
+ } while (t);
+ } while (cmd == MSG_SENDMETA);
+
+ /* check the finish message of the client */
+ while(!done) {
+ ret = tracecmd_msg_recv(ifd, buf);
+ if (ret < 0) {
+ warning("reading client");
+ return ret;
+ }
+
+ msg = (struct tracecmd_msg *)buf;
+ cmd = ntohl(msg->cmd);
+ if (cmd == MSG_CLOSE)
+ /* Finish this connection */
+ break;
+ else {
+ warning("Not accept the message %d", ntohl(msg->cmd));
+ ret = -EINVAL;
+ goto error;
+ }
+ }
+
+ return 0;
+
+error:
+ error_operation_for_server(msg);
+ return ret;
+}
diff --git a/trace-msg.h b/trace-msg.h
new file mode 100644
index 0000000..9ec95e5
--- /dev/null
+++ b/trace-msg.h
@@ -0,0 +1,21 @@
+#ifndef _TRACE_MSG_H_
+#define _TRACE_MSG_H_
+
+#include <stdbool.h>
+
+/* for both clients and server */
+bool use_tcp;
+int cpu_count;
+
+/* for clients */
+unsigned int page_size;
+int *client_ports;
+bool send_metadata;
+
+/* for server */
+bool done;
+
+void plog(const char *fmt, ...);
+void pdie(const char *fmt, ...);
+
+#endif /* _TRACE_MSG_H_ */
diff --git a/trace-output.c b/trace-output.c
index bdb478d..6e1298b 100644
--- a/trace-output.c
+++ b/trace-output.c
@@ -36,6 +36,7 @@
#include <glob.h>

#include "trace-cmd-local.h"
+#include "trace-msg.h"
#include "version.h"

/* We can't depend on the host size for size_t, all must be 64 bit */
@@ -80,6 +81,9 @@ struct list_event_system {
static stsize_t
do_write_check(struct tracecmd_output *handle, void *data, tsize_t size)
{
+ if (send_metadata)
+ return tracecmd_msg_metadata_send(handle->fd, data, size);
+
return __do_write_check(handle->fd, data, size);
}

diff --git a/trace-record.c b/trace-record.c
index 6a096bd..ef30e31 100644
--- a/trace-record.c
+++ b/trace-record.c
@@ -45,6 +45,7 @@
#include <errno.h>

#include "trace-local.h"
+#include "trace-msg.h"

#define _STR(x) #x
#define STR(x) _STR(x)
@@ -59,29 +60,21 @@
#define STAMP "stamp"
#define FUNC_STACK_TRACE "func_stack_trace"

-#define UDP_MAX_PACKET (65536 - 20)
-
static int tracing_on_init_val;

static int rt_prio;

-static int use_tcp;
-
-static unsigned int page_size;
-
static int buffer_size;

static const char *output_file = "trace.dat";

static int latency;
static int sleep_time = 1000;
-static int cpu_count;
static int recorder_threads;
static int *pids;
static int buffers;

static char *host;
-static int *client_ports;
static int sfd;

/* Max size to let a per cpu file get */
@@ -1609,70 +1602,8 @@ static int create_recorder(struct buffer_instance *instance, int cpu, int extrac

static void communicate_with_listener(int fd)
{
- char buf[BUFSIZ];
- ssize_t n;
- int cpu, i;
-
- n = read(fd, buf, 8);
-
- /* Make sure the server is the tracecmd server */
- if (memcmp(buf, "tracecmd", 8) != 0)
- die("server not tracecmd server");
-
- /* write the number of CPUs we have (in ASCII) */
-
- sprintf(buf, "%d", cpu_count);
-
- /* include \0 */
- write(fd, buf, strlen(buf)+1);
-
- /* write the pagesize (in ASCII) */
- sprintf(buf, "%d", page_size);
-
- /* include \0 */
- write(fd, buf, strlen(buf)+1);
-
- /*
- * If we are using IPV4 and our page size is greater than
- * or equal to 64K, we need to punt and use TCP. :-(
- */
-
- /* TODO, test for ipv4 */
- if (page_size >= UDP_MAX_PACKET) {
- warning("page size too big for UDP using TCP in live read");
- use_tcp = 1;
- }
-
- if (use_tcp) {
- /* Send one option */
- write(fd, "1", 2);
- /* Size 4 */
- write(fd, "4", 2);
- /* use TCP */
- write(fd, "TCP", 4);
- } else
- /* No options */
- write(fd, "0", 2);
-
- client_ports = malloc_or_die(sizeof(int) * cpu_count);
-
- /*
- * Now we will receive back a comma deliminated list
- * of client ports to connect to.
- */
- for (cpu = 0; cpu < cpu_count; cpu++) {
- for (i = 0; i < BUFSIZ; i++) {
- n = read(fd, buf+i, 1);
- if (n != 1)
- die("Error, reading server ports");
- if (!buf[i] || buf[i] == ',')
- break;
- }
- if (i == BUFSIZ)
- die("read bad port number");
- buf[i] = 0;
- client_ports[cpu] = atoi(buf);
- }
+ if (tracecmd_msg_connect_to_server(fd) < 0)
+ die("Cannot communicate with server");
}

static void setup_network(void)
@@ -1727,12 +1658,14 @@ static void setup_network(void)

/* Now create the handle through this socket */
handle = tracecmd_create_init_fd_glob(sfd, listed_events);
+ tracecmd_msg_finish_sending_metadata(sfd);

/* OK, we are all set, let'r rip! */
}

static void finish_network(void)
{
+ tracecmd_msg_send_close_msg();
close(sfd);
free(host);
}

2013-08-19 09:42:49

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 11/11] trace-cmd: Add --virt option for record mode

Add --virt option for record mode for a virtualization environment.
If we use this option on a guest, we can send trace data in low overhead.
This is because guests can send trace data to a host without copying the data
by using splice(2).

The format is:

trace-cmd record --virt -e sched*

<Restriction>
This feature can use from kernel-3.6 which supports splice_read for ftrace
and splice_write for virtio-serial.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
Documentation/trace-cmd-record.1.txt | 11 +++++-
trace-cmd.h | 3 +-
trace-msg.c | 35 ++++++++++++++++--
trace-msg.h | 4 ++
trace-record.c | 66 ++++++++++++++++++++++++++++++++--
5 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/trace-cmd-record.1.txt b/Documentation/trace-cmd-record.1.txt
index 832a257..7eb8ac9 100644
--- a/Documentation/trace-cmd-record.1.txt
+++ b/Documentation/trace-cmd-record.1.txt
@@ -240,6 +240,15 @@ OPTIONS
timestamp to gettimeofday which will allow wall time output from the
timestamps reading the created 'trace.dat' file.

+*--virt*::
+ This option is usded on a guest in a virtualization environment. If a host
+ is running "trace-cmd virt-server", this option is used to have the data
+ sent to the host with virtio-serial like *-N* option. (see also
+ trace-cmd-virt-server(1))
+
+ Note: This option is not supported with latency tracer plugins:
+ wakeup, wakeup_rt, irqsoff, preemptoff and preemptirqsoff
+
EXAMPLES
--------

@@ -302,7 +311,7 @@ SEE ALSO
--------
trace-cmd(1), trace-cmd-report(1), trace-cmd-start(1), trace-cmd-stop(1),
trace-cmd-extract(1), trace-cmd-reset(1), trace-cmd-split(1),
-trace-cmd-list(1), trace-cmd-listen(1)
+trace-cmd-list(1), trace-cmd-listen(1), trace-cmd-virt-server(1)

AUTHOR
------
diff --git a/trace-cmd.h b/trace-cmd.h
index b4c2267..5d895ff 100644
--- a/trace-cmd.h
+++ b/trace-cmd.h
@@ -251,7 +251,8 @@ void tracecmd_stat_cpu(struct trace_seq *s, int cpu);
long tracecmd_flush_recording(struct tracecmd_recorder *recorder);

/* for clients */
-int tracecmd_msg_connect_to_server(int fd);
+int tracecmd_msg_connect_to_server_nw(int fd);
+int tracecmd_msg_connect_to_server_virt(int fd);
int tracecmd_msg_metadata_send(int fd, char *buf, int size);
int tracecmd_msg_finish_sending_metadata(int fd);
void tracecmd_msg_send_close_msg();
diff --git a/trace-msg.c b/trace-msg.c
index 251e99c..cdeeed3 100644
--- a/trace-msg.c
+++ b/trace-msg.c
@@ -30,6 +30,7 @@
#include <stdio.h>
#include <unistd.h>
#include <arpa/inet.h>
+#include <sys/stat.h>
#include <sys/types.h>
#include <linux/types.h>

@@ -503,11 +504,12 @@ static int tracecmd_msg_wait_for_msg(int fd, struct tracecmd_msg **msg)
return 0;
}

-int tracecmd_msg_connect_to_server(int fd)
+static int tracecmd_msg_connect_to_server(int fd, bool nw)
{
struct tracecmd_msg *msg;
u32 reply_cmd, cmd;
int i, ret, cpus;
+ char buf[PATH_MAX];

/* connect to a server */
cmd = MSG_TCONNECT;
@@ -529,9 +531,24 @@ int tracecmd_msg_connect_to_server(int fd)
} while (reply_cmd != MSG_RINIT);

cpus = ntohl(msg->data.rinit.cpus);
- client_ports = malloc_or_die(sizeof(int) * cpus);
- for (i = 0; i < cpus; i++)
- client_ports[i] = ntohl(msg->data.rinit.port_array[i]);
+ if (nw) {
+ client_ports = malloc_or_die(sizeof(int) * cpus);
+ for (i = 0; i < cpus; i++)
+ client_ports[i] =
+ ntohl(msg->data.rinit.port_array[i]);
+ } else {
+ virt_sfds = malloc_or_die(sizeof(int) * cpus);
+
+ /* Open data paths of virtio-serial */
+ for (i = 0; i < cpus; i++) {
+ snprintf(buf, PATH_MAX, TRACE_PATH_CPU, i);
+ virt_sfds[i] = open(buf, O_WRONLY);
+ if (virt_sfds[i] < 0) {
+ warning("Cannot open %s", TRACE_PATH_CPU, i);
+ return -errno;
+ }
+ }
+ }

/* Next, send meta data */
send_metadata = true;
@@ -543,6 +560,16 @@ error:
return ret;
}

+int tracecmd_msg_connect_to_server_nw(int fd)
+{
+ return tracecmd_msg_connect_to_server(fd, true);
+}
+
+int tracecmd_msg_connect_to_server_virt(int fd)
+{
+ return tracecmd_msg_connect_to_server(fd, false);
+}
+
static bool process_option(struct tracecmd_msg_opt *opt)
{
/* currently the only option we have is to us TCP */
diff --git a/trace-msg.h b/trace-msg.h
index 9ec95e5..d1f7321 100644
--- a/trace-msg.h
+++ b/trace-msg.h
@@ -2,6 +2,9 @@
#define _TRACE_MSG_H_

#include <stdbool.h>
+#define VIRTIO_PORTS "/dev/virtio-ports/"
+#define AGENT_CTL_PATH VIRTIO_PORTS "agent-ctl-path"
+#define TRACE_PATH_CPU VIRTIO_PORTS "trace-path-cpu%d"

/* for both clients and server */
bool use_tcp;
@@ -11,6 +14,7 @@ int cpu_count;
unsigned int page_size;
int *client_ports;
bool send_metadata;
+int *virt_sfds;

/* for server */
bool done;
diff --git a/trace-record.c b/trace-record.c
index ef30e31..df9e786 100644
--- a/trace-record.c
+++ b/trace-record.c
@@ -80,6 +80,9 @@ static int sfd;
/* Max size to let a per cpu file get */
static int max_kb;

+struct tracecmd_output *virt_handle;
+static bool virt;
+
static int do_ptrace;

static int filter_task;
@@ -1576,6 +1579,9 @@ static int create_recorder(struct buffer_instance *instance, int cpu, int extrac
if (client_ports) {
connect_port(cpu);
recorder = tracecmd_create_recorder_fd(client_ports[cpu], cpu, recorder_flags);
+ } else if (virt_sfds) {
+ recorder = tracecmd_create_recorder_fd(virt_sfds[cpu], cpu,
+ recorder_flags);
} else {
file = get_temp_file(instance, cpu);
recorder = create_recorder_instance(instance, file, cpu);
@@ -1600,9 +1606,15 @@ static int create_recorder(struct buffer_instance *instance, int cpu, int extrac
exit(0);
}

-static void communicate_with_listener(int fd)
+static void communicate_with_listener_nw(int fd)
+{
+ if (tracecmd_msg_connect_to_server_nw(fd) < 0)
+ die("Cannot communicate with server");
+}
+
+static void communicate_with_listener_virt(int fd)
{
- if (tracecmd_msg_connect_to_server(fd) < 0)
+ if (tracecmd_msg_connect_to_server_virt(fd) < 0)
die("Cannot communicate with server");
}

@@ -1654,7 +1666,7 @@ static void setup_network(void)

freeaddrinfo(result);

- communicate_with_listener(sfd);
+ communicate_with_listener_nw(sfd);

/* Now create the handle through this socket */
handle = tracecmd_create_init_fd_glob(sfd, listed_events);
@@ -1663,6 +1675,21 @@ static void setup_network(void)
/* OK, we are all set, let'r rip! */
}

+static void setup_virtio(void)
+{
+ int fd;
+
+ fd = open(AGENT_CTL_PATH, O_RDWR);
+ if (fd < 0)
+ die("Cannot open %s", AGENT_CTL_PATH);
+
+ communicate_with_listener_virt(fd);
+
+ /* Now create the handle through this socket */
+ virt_handle = tracecmd_create_init_fd_glob(fd, listed_events);
+ tracecmd_msg_finish_sending_metadata(fd);
+}
+
static void finish_network(void)
{
tracecmd_msg_send_close_msg();
@@ -1670,6 +1697,13 @@ static void finish_network(void)
free(host);
}

+static void finish_virt(void)
+{
+ tracecmd_msg_send_close_msg();
+ free(virt_handle);
+ free(virt_sfds);
+}
+
static void start_threads(void)
{
struct buffer_instance *instance;
@@ -1677,6 +1711,8 @@ static void start_threads(void)

if (host)
setup_network();
+ else if (virt)
+ setup_virtio();

/* make a thread for every CPU we have */
pids = malloc_or_die(sizeof(*pids) * cpu_count * (buffers + 1));
@@ -1721,6 +1757,9 @@ static void record_data(char *date2ts, struct trace_seq *s)
if (host) {
finish_network();
return;
+ } else if (virt) {
+ finish_virt();
+ return;
}

if (latency)
@@ -2207,6 +2246,7 @@ static void record_all_events(void)
}

enum {
+ OPT_virt = 252,
OPT_nosplice = 253,
OPT_funcstack = 254,
OPT_date = 255,
@@ -2278,6 +2318,7 @@ void trace_record (int argc, char **argv)
{"date", no_argument, NULL, OPT_date},
{"func-stack", no_argument, NULL, OPT_funcstack},
{"nosplice", no_argument, NULL, OPT_nosplice},
+ {"virt", no_argument, NULL, OPT_virt},
{"help", no_argument, NULL, '?'},
{NULL, 0, NULL, 0}
};
@@ -2389,6 +2430,8 @@ void trace_record (int argc, char **argv)
case 'o':
if (host)
die("-o incompatible with -N");
+ if (virt)
+ die("-o incompatible with --virt");
if (!record && !extract)
die("start does not take output\n"
"Did you mean 'record'?");
@@ -2420,6 +2463,8 @@ void trace_record (int argc, char **argv)
case 'N':
if (!record)
die("-N only available with record");
+ if (virt)
+ die("-N incompatible with --virt");
if (output)
die("-N incompatible with -o");
host = optarg;
@@ -2432,6 +2477,8 @@ void trace_record (int argc, char **argv)
max_kb = atoi(optarg);
break;
case 't':
+ if (virt)
+ die("-t incompatible with --virt");
use_tcp = 1;
break;
case 'b':
@@ -2458,6 +2505,17 @@ void trace_record (int argc, char **argv)
case OPT_nosplice:
recorder_flags |= TRACECMD_RECORD_NOSPLICE;
break;
+ case OPT_virt:
+ if (!record)
+ die("--virt only available with record");
+ if (host)
+ die("--virt incompatible with -N");
+ if (output)
+ die("--virt incompatible with -o");
+ if (use_tcp)
+ die("--virt incompatible with -t");
+ virt = true;
+ break;
default:
usage(argv);
}
@@ -2533,6 +2591,8 @@ void trace_record (int argc, char **argv)
latency = 1;
if (host)
die("Network tracing not available with latency tracer plugins");
+ if (virt)
+ die("Virtio-trace not available with latency tracer plugins");
}
if (fset < 0 && (strcmp(plugin, "function") == 0 ||
strcmp(plugin, "function_graph") == 0))

2013-08-19 09:43:18

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 10/11] trace-cmd: Add virt-server mode for a virtualization environment

Add the virt-server mode for a virtualization environment based on the listen
mode for networking. This mode works like client/server mode over TCP/UDP,
but it uses virtio-serial channel instead of IP network. Using networking for
collecting trace data of guests is generally high overhead caused by processing
of the network stack.

We use virtio-serial for collecting trace data of guests. virtio-serial is a
simple communication path between the guest and the host. Moreover,
since virtio-serial and ftrace can use splice(2), memory copying is not
occurred on the guests. Therefore, total overhead for collecting trace data
of the guests will be reduced. The implementation of guests will be shown
in another patch.

virt-server uses two kinds of virtio-serial I/Fs:
(1) agent-ctl-path(UNIX domain socket)
=> control path of an agent trace-cmd each guest
(2) trace-path-cpuX(named pipe)
=> trace data path each vcpu

Those I/Fs must be stored as follows:
(1) /tmp/trace-cmd/virt/agent-ctl-path
(2) /tmp/trace-cmd/virt/<guest domain>/trace-path-cpuX

If we run virt-server, agent-ctl-path I/F is automatically created because
virt-server operates as a server mode of UNIX domain socket. However,
trace-path-cpuX is not automatically created because we need to separate
trace data for each guests.

<How to set up>
1. Run virt-server on a host before booting guests
# trace-cmd virt-server

2. Make guest domain directory
# mkdir -p /tmp/trace-cmd/virt/<domain>
# chmod 710 /tmp/trace-cmd/virt/<domain>
# chgrp qemu /tmp/trace-cmd/virt/<domain>

3. Make FIFO on the host
# mkfifo /tmp/trace-cmd/virt/<domain>/trace-path-cpu{0,1,...,X}.{in,out}

4. Set up of virtio-serial pipe of a guest on the host
Add the following tags to domain XML files.
# virsh edit <domain>
<channel type='unix'>
<source mode='connect' path='/tmp/trace-cmd/virt/agent-ctl-path'/>
<target type='virtio' name='agent-ctl-path'/>
</channel>
<channel type='pipe'>
<source path='/tmp/trace-cmd/virt/<domain>/trace-path-cpu0'/>
<target type='virtio' name='trace-path-cpu0'/>
</channel>
... (cpu1, cpu2, ...)

5. Boot the guest
# virsh start <domain>

6. Check I/F of virtio-serial on the guest
# ls /dev/virtio-ports
...
agent-ctl-path
...
trace-path-cpu0
...

Next, the user will run trace-cmd with record --virt options or other options
for virtualization on the guest.

This patch adds only minimum features of virt-server as follows:
<Features>
- Add virt-server subcommand
- Create I/F directory(/tmp/trace-cmd/virt/)
- Use named pipe I/Fs of virtio-serial for trace data paths
- Use UNIX domain socket for connecting agents on guests
- Use splice(2) for collecting trace data of guests

<Restrictions>
- Use libvirt when we boot guests

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
Documentation/trace-cmd-virt-server.1.txt | 89 ++++++
trace-cmd.c | 3
trace-cmd.h | 4
trace-listen.c | 439 ++++++++++++++++++++++++-----
trace-msg.c | 51 ++-
trace-recorder.c | 54 +++-
trace-usage.c | 10 +
7 files changed, 540 insertions(+), 110 deletions(-)
create mode 100644 Documentation/trace-cmd-virt-server.1.txt

diff --git a/Documentation/trace-cmd-virt-server.1.txt b/Documentation/trace-cmd-virt-server.1.txt
new file mode 100644
index 0000000..4168a04
--- /dev/null
+++ b/Documentation/trace-cmd-virt-server.1.txt
@@ -0,0 +1,89 @@
+TRACE-CMD-VIRT-SERVER(1)
+========================
+
+NAME
+----
+trace-cmd-virt-server - listen for incoming connection to record tracing of
+ guests' clients
+
+SYNOPSIS
+--------
+*trace-cmd virt-server ['OPTIONS']
+
+DESCRIPTION
+-----------
+The trace-cmd(1) virt-server sets up UNIX domain socket I/F for communicating
+with guests' clients that run 'trace-cmd-record(1)' with the *--virt* option.
+When a connection is made, and the guest's client sends data, it will create a
+file called 'trace.DOMAIN.dat'. Where DOMAIN is the name of the guest named
+by libvirt.
+
+OPTIONS
+-------
+*-D*::
+ This options causes trace-cmd listen to go into a daemon mode and run in
+ the background.
+
+*-d* 'dir'::
+ This option specifies a directory to write the data files into.
+
+*-o* 'filename'::
+ This option overrides the default 'trace' in the 'trace.DOMAIN.dat' that
+ is created when guest's client connects.
+
+*-l* 'filename'::
+ This option writes the output messages to a log file instead of standard output.
+
+SET UP
+------
+Here, an example is written as follows:
+
+1. Run virt-server on a host
+ # trace-cmd virt-server
+
+2. Make guest domain directory
+ # mkdir -p /tmp/trace-cmd/virt/<DOMAIN>
+ # chmod 710 /tmp/trace-cmd/virt/<DOMAIN>
+ # chgrp qemu /tmp/trace-cmd/virt/<DOMAIN>
+
+3. Make FIFO on the host
+ # mkfifo /tmp/trace-cmd/virt/<DOMAIN>/trace-path-cpu{0,1,...,X}.{in,out}
+
+4. Set up of virtio-serial pipe of a guest on the host
+ Add the following tags to domain XML files.
+ # virsh edit <guest domain>
+ <channel type='unix'>
+ <source mode='connect' path='/tmp/trace-cmd/virt/agent-ctl-path'/>
+ <target type='virtio' name='agent-ctl-path'/>
+ </channel>
+ <channel type='pipe'>
+ <source path='/tmp/trace-cmd/virt/<DOMAIN>/trace-path-cpu0'/>
+ <target type='virtio' name='trace-path-cpu0'/>
+ </channel>
+ ... (cpu1, cpu2, ...)
+
+5. Boot the guest
+ # virsh start <DOMAIN>
+
+6. Run the guest's client(see trace-cmd-record(1) with the *--virt* option)
+ # trace-cmd record -e sched* --virt
+
+SEE ALSO
+--------
+trace-cmd(1), trace-cmd-record(1), trace-cmd-report(1), trace-cmd-start(1),
+trace-cmd-stop(1), trace-cmd-extract(1), trace-cmd-reset(1),
+trace-cmd-split(1), trace-cmd-list(1)
+
+AUTHOR
+------
+Written by Yoshihiro YUNOMAE, <[email protected]>
+
+RESOURCES
+---------
+git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git
+
+COPYING
+-------
+Copyright \(C) 2013 Hitachi, Ltd. Free use of this software is granted under
+the terms of the GNU Public License (GPL).
+
diff --git a/trace-cmd.c b/trace-cmd.c
index e6f5918..45a5bb4 100644
--- a/trace-cmd.c
+++ b/trace-cmd.c
@@ -219,7 +219,8 @@ int main (int argc, char **argv)
} else if (strcmp(argv[1], "mem") == 0) {
trace_mem(argc, argv);
exit(0);
- } else if (strcmp(argv[1], "listen") == 0) {
+ } else if (strcmp(argv[1], "listen") == 0 ||
+ strcmp(argv[1], "virt-server") == 0) {
trace_listen(argc, argv);
exit(0);
} else if (strcmp(argv[1], "split") == 0) {
diff --git a/trace-cmd.h b/trace-cmd.h
index 5ae5313..b4c2267 100644
--- a/trace-cmd.h
+++ b/trace-cmd.h
@@ -242,6 +242,8 @@ struct tracecmd_recorder *tracecmd_create_recorder_maxkb(const char *file, int c
struct tracecmd_recorder *tracecmd_create_buffer_recorder_fd(int fd, int cpu, unsigned flags, const char *buffer);
struct tracecmd_recorder *tracecmd_create_buffer_recorder(const char *file, int cpu, unsigned flags, const char *buffer);
struct tracecmd_recorder *tracecmd_create_buffer_recorder_maxkb(const char *file, int cpu, unsigned flags, const char *buffer, int maxkb);
+struct tracecmd_recorder *tracecmd_create_recorder_virt(const char *file,
+ int cpu, int trace_fd);

int tracecmd_start_recording(struct tracecmd_recorder *recorder, unsigned long sleep);
void tracecmd_stop_recording(struct tracecmd_recorder *recorder);
@@ -255,7 +257,7 @@ int tracecmd_msg_finish_sending_metadata(int fd);
void tracecmd_msg_send_close_msg();

/* for server */
-int tracecmd_msg_set_connection(int fd);
+int tracecmd_msg_set_connection(int fd, bool nw);
int tracecmd_msg_initial_setting(int fd, int *cpus, int *pagesize);
int tracecmd_msg_send_port_array(int fd, int total_cpus, int *ports);
int tracecmd_msg_collect_metadata(int ifd, int ofd);
diff --git a/trace-listen.c b/trace-listen.c
index 3cec10c..3105903 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -23,9 +23,13 @@
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
+#include <grp.h>
+#include <sys/stat.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
+#include <sys/epoll.h>
+#include <sys/un.h>
#include <netdb.h>
#include <unistd.h>
#include <fcntl.h>
@@ -46,15 +50,23 @@ static int debug;

static int backlog = 5;

-#define TEMP_FILE_STR "%s.%s:%s.cpu%d", output_file, host, port, cpu
-static char *get_temp_file(const char *host, const char *port, int cpu)
+#define TEMP_FILE_STR_NW "%s.%s:%s.cpu%d", output_file, host, port, cpu
+#define TEMP_FILE_STR_VIRT "%s.%s:%d.cpu%d", output_file, domain, virtpid, cpu
+static char *get_temp_file(const char *host, const char *port,
+ const char *domain, int virtpid, int cpu)
{
char *file = NULL;
int size;

- size = snprintf(file, 0, TEMP_FILE_STR);
- file = malloc_or_die(size + 1);
- sprintf(file, TEMP_FILE_STR);
+ if (host) {
+ size = snprintf(file, 0, TEMP_FILE_STR_NW);
+ file = malloc_or_die(size + 1);
+ sprintf(file, TEMP_FILE_STR_NW);
+ } else {
+ size = snprintf(file, 0, TEMP_FILE_STR_VIRT);
+ file = malloc_or_die(size + 1);
+ sprintf(file, TEMP_FILE_STR_VIRT);
+ }

return file;
}
@@ -77,16 +89,23 @@ static void signal_setup(int sig, sighandler_t handle)
sigaction(sig, &action, NULL);
}

-static void delete_temp_file(const char *host, const char *port, int cpu)
+static void delete_temp_file(const char *host, const char *port,
+ const char *domain, int virtpid, int cpu)
{
char file[MAX_PATH];

- snprintf(file, MAX_PATH, TEMP_FILE_STR);
+ if (host)
+ snprintf(file, MAX_PATH, TEMP_FILE_STR_NW);
+ else
+ snprintf(file, MAX_PATH, TEMP_FILE_STR_VIRT);
unlink(file);
}

+static struct tracecmd_recorder *recorder;
static void finish(int sig)
{
+ if (recorder)
+ tracecmd_stop_recording(recorder);
done = true;
}

@@ -156,7 +175,7 @@ static void process_udp_child(int sfd, const char *host, const char *port,

signal_setup(SIGUSR1, finish);

- tempfile = get_temp_file(host, port, cpu);
+ tempfile = get_temp_file(host, port, NULL, 0, cpu);
fd = open(tempfile, O_WRONLY | O_TRUNC | O_CREAT, 0644);
if (fd < 0)
pdie("creating %s", tempfile);
@@ -197,6 +216,28 @@ static void process_udp_child(int sfd, const char *host, const char *port,
exit(0);
}

+#define SLEEP_DEFAULT 1000
+
+static void process_virt_child(int fd, int cpu, int pagesize,
+ const char *domain, int virtpid)
+{
+ char *tempfile;
+
+ signal_setup(SIGUSR1, finish);
+ tempfile = get_temp_file(NULL, NULL, domain, virtpid, cpu);
+
+ recorder = tracecmd_create_recorder_virt(tempfile, cpu, fd);
+
+ do {
+ if (tracecmd_start_recording(recorder, SLEEP_DEFAULT) < 0)
+ break;
+ } while (!done);
+
+ tracecmd_free_recorder(recorder);
+ put_temp_file(tempfile);
+ exit(0);
+}
+
#define START_PORT_SEARCH 1500
#define MAX_PORT_SEARCH 6000

@@ -244,20 +285,37 @@ static int udp_bind_a_port(int start_port, int *sfd)
return num_port;
}

-static void fork_udp_reader(int sfd, const char *node, const char *port,
- int *pid, int cpu, int pagesize)
+static void fork_reader(int sfd, const char *node, const char *port,
+ int *pid, int cpu, int pagesize, const char *domain,
+ int virtpid)
{
*pid = fork();

if (*pid < 0)
- pdie("creating udp reader");
+ pdie("creating reader");

- if (!*pid)
- process_udp_child(sfd, node, port, cpu, pagesize);
+ if (!*pid) {
+ if (node)
+ process_udp_child(sfd, node, port, cpu, pagesize);
+ else
+ process_virt_child(sfd, cpu, pagesize, domain, virtpid);
+ }

close(sfd);
}

+static void fork_udp_reader(int sfd, const char *node, const char *port,
+ int *pid, int cpu, int pagesize)
+{
+ fork_reader(sfd, node, port, pid, cpu, pagesize, NULL, 0);
+}
+
+static void fork_virt_reader(int sfd, int *pid, int cpu, int pagesize,
+ const char *domain, int virtpid)
+{
+ fork_reader(sfd, NULL, NULL, pid, cpu, pagesize, domain, virtpid);
+}
+
static int open_udp(const char *node, const char *port, int *pid,
int cpu, int pagesize, int start_port)
{
@@ -273,10 +331,33 @@ static int open_udp(const char *node, const char *port, int *pid,
return num_port;
}

-static int communicate_with_client(int fd, int *cpus, int *pagesize)
+#define TRACE_CMD_DIR "/tmp/trace-cmd/"
+#define VIRT_DIR TRACE_CMD_DIR "virt/"
+#define VIRT_TRACE_CTL_SOCK VIRT_DIR "agent-ctl-path"
+#define TRACE_PATH_DOMAIN_CPU VIRT_DIR "%s/trace-path-cpu%d.out"
+
+static int open_virtio_serial_pipe(int *pid, int cpu, int pagesize,
+ const char *domain, int virtpid)
+{
+ char buf[PATH_MAX];
+ int fd;
+
+ snprintf(buf, PATH_MAX, TRACE_PATH_DOMAIN_CPU, domain, cpu);
+ fd = open(buf, O_RDONLY | O_NONBLOCK);
+ if (fd < 0) {
+ warning("open %s", buf);
+ return fd;
+ }
+
+ fork_virt_reader(fd, pid, cpu, pagesize, domain, virtpid);
+
+ return fd;
+}
+
+static int communicate_with_client(int fd, int *cpus, int *pagesize, bool nw)
{
/* Let the client know what we are */
- if (tracecmd_msg_set_connection(fd) < 0)
+ if (tracecmd_msg_set_connection(fd, nw) < 0)
return -1;

/* read the CPU count, the page size, and options */
@@ -289,12 +370,26 @@ static int communicate_with_client(int fd, int *cpus, int *pagesize)
return 0;
}

-static int create_client_file(const char *node, const char *port)
+static int communicate_with_client_nw(int fd, int *cpus, int *pagesize)
+{
+ return communicate_with_client(fd, cpus, pagesize, true);
+}
+
+static int communicate_with_client_virt(int fd, int *cpus, int *pagesize)
+{
+ return communicate_with_client(fd, cpus, pagesize, false);
+}
+
+static int create_client_file(const char *node, const char *port,
+ const char *domain, int pid)
{
char buf[BUFSIZ];
int ofd;

- snprintf(buf, BUFSIZ, "%s.%s:%s.dat", output_file, node, port);
+ if (node)
+ snprintf(buf, BUFSIZ, "%s.%s:%s.dat", output_file, node, port);
+ else
+ snprintf(buf, BUFSIZ, "%s.%s:%d.dat", output_file, domain, pid);

ofd = open(buf, O_RDWR | O_CREAT | O_TRUNC, 0644);
if (ofd < 0)
@@ -303,7 +398,8 @@ static int create_client_file(const char *node, const char *port)
}

static void destroy_all_readers(int cpus, int *pid_array, const char *node,
- const char *port)
+ const char *port, const char *domain,
+ int virtpid)
{
int cpu;

@@ -311,41 +407,49 @@ static void destroy_all_readers(int cpus, int *pid_array, const char *node,
if (pid_array[cpu] > 0) {
kill(pid_array[cpu], SIGKILL);
waitpid(pid_array[cpu], NULL, 0);
- delete_temp_file(node, port, cpu);
+ delete_temp_file(node, port, domain, virtpid, cpu);
pid_array[cpu] = 0;
}
}
}

static int *create_all_readers(int cpus, const char *node, const char *port,
- int pagesize, int fd)
+ const char *domain, int virtpid, int pagesize,
+ int fd)
{
- int *port_array;
+ int *port_array = NULL;
int *pid_array;
int start_port;
int udp_port;
int cpu;
int pid;

- port_array = malloc_or_die(sizeof(int) * cpus);
+ if (node) {
+ port_array = malloc_or_die(sizeof(int) * cpus);
+ start_port = START_PORT_SEARCH;
+ }
pid_array = malloc_or_die(sizeof(int) * cpus);
memset(pid_array, 0, sizeof(int) * cpus);

- start_port = START_PORT_SEARCH;
-
- /* Now create a UDP port for each CPU */
+ /* Now create a reader for each CPU */
for (cpu = 0; cpu < cpus; cpu++) {
- udp_port = open_udp(node, port, &pid, cpu,
- pagesize, start_port);
- if (udp_port < 0)
- goto out_free;
- port_array[cpu] = udp_port;
+ if (node) {
+ udp_port = open_udp(node, port, &pid, cpu,
+ pagesize, start_port);
+ if (udp_port < 0)
+ goto out_free;
+ port_array[cpu] = udp_port;
+ /*
+ * due to some bugging finding ports,
+ * force search after last port
+ */
+ start_port = udp_port + 1;
+ } else {
+ if (open_virtio_serial_pipe(&pid, cpu, pagesize,
+ domain, virtpid) < 0)
+ goto out_free;
+ }
pid_array[cpu] = pid;
- /*
- * due to some bugging finding ports,
- * force search after last port
- */
- start_port = udp_port + 1;
}

/* send set of port numbers to the client */
@@ -355,7 +459,7 @@ static int *create_all_readers(int cpus, const char *node, const char *port,
return pid_array;

out_free:
- destroy_all_readers(cpus, pid_array, node, port);
+ destroy_all_readers(cpus, pid_array, node, port, domain, virtpid);
return NULL;
}

@@ -370,7 +474,7 @@ static void stop_all_readers(int cpus, int *pid_array)
}

static void put_together_file(int cpus, int ofd, const char *node,
- const char *port)
+ const char *port, const char *domain, int virtpid)
{
char **temp_files;
int cpu;
@@ -379,25 +483,32 @@ static void put_together_file(int cpus, int ofd, const char *node,
temp_files = malloc_or_die(sizeof(*temp_files) * cpus);

for (cpu = 0; cpu < cpus; cpu++)
- temp_files[cpu] = get_temp_file(node, port, cpu);
+ temp_files[cpu] = get_temp_file(node, port, domain,
+ virtpid, cpu);

tracecmd_attach_cpu_data_fd(ofd, cpus, temp_files);
free(temp_files);
}

-static void process_client(const char *node, const char *port, int fd)
+static void process_client(const char *node, const char *port,
+ const char *domain, int virtpid, int fd)
{
int *pid_array;
int pagesize;
int cpus;
int ofd;

- if (communicate_with_client(fd, &cpus, &pagesize) < 0)
- return;
-
- ofd = create_client_file(node, port);
+ if (node) {
+ if (communicate_with_client_nw(fd, &cpus, &pagesize) < 0)
+ return;
+ } else {
+ if (communicate_with_client_virt(fd, &cpus, &pagesize) < 0)
+ return;
+ }

- pid_array = create_all_readers(cpus, node, port, pagesize, fd);
+ ofd = create_client_file(node, port, domain, virtpid);
+ pid_array = create_all_readers(cpus, node, port, domain, virtpid,
+ pagesize, fd);
if (!pid_array)
return;

@@ -413,9 +524,22 @@ static void process_client(const char *node, const char *port, int fd)
/* wait a little to have the readers clean up */
sleep(1);

- put_together_file(cpus, ofd, node, port);
+ put_together_file(cpus, ofd, node, port, domain, virtpid);
+
+ destroy_all_readers(cpus, pid_array, node, port, domain, virtpid);
+}
+
+static void process_client_nw(const char *node, const char *port, int fd)
+{
+ process_client(node, port, NULL, 0, fd);
+}

- destroy_all_readers(cpus, pid_array, node, port);
+static void process_client_virt(const char *domain, int virtpid, int fd)
+{
+ /* keep connection to qemu if clients on guests finish operation */
+ do {
+ process_client(NULL, NULL, domain, virtpid, fd);
+ } while (!done);
}

static int do_fork(int cfd)
@@ -442,8 +566,9 @@ static int do_fork(int cfd)
return 0;
}

-static int do_connection(int cfd, struct sockaddr_storage *peer_addr,
- socklen_t peer_addr_len)
+static int do_connection(int cfd, struct sockaddr *peer_addr,
+ socklen_t *peer_addr_len, const char *domain,
+ int virtpid)
{
char host[NI_MAXHOST], service[NI_MAXSERV];
int s;
@@ -453,21 +578,22 @@ static int do_connection(int cfd, struct sockaddr_storage *peer_addr,
if (ret)
return ret;

- s = getnameinfo((struct sockaddr *)peer_addr, peer_addr_len,
- host, NI_MAXHOST,
- service, NI_MAXSERV, NI_NUMERICSERV);
-
- if (s == 0)
- plog("Connected with %s:%s\n",
- host, service);
- else {
- plog("Error with getnameinfo: %s\n",
- gai_strerror(s));
- close(cfd);
- return -1;
- }
-
- process_client(host, service, cfd);
+ if (peer_addr) {
+ s = getnameinfo(peer_addr, *peer_addr_len, host, NI_MAXHOST,
+ service, NI_MAXSERV, NI_NUMERICSERV);
+
+ if (s == 0)
+ plog("Connected with %s:%s\n",
+ host, service);
+ else {
+ plog("Error with getnameinfo: %s\n",
+ gai_strerror(s));
+ close(cfd);
+ return -1;
+ }
+ process_client_nw(host, service, cfd);
+ } else
+ process_client_virt(domain, virtpid, cfd);

close(cfd);

@@ -477,6 +603,77 @@ static int do_connection(int cfd, struct sockaddr_storage *peer_addr,
return 0;
}

+static int do_connection_nw(int cfd, struct sockaddr *addr, socklen_t *addrlen)
+{
+ return do_connection(cfd, addr, addrlen, NULL, 0);
+}
+
+#define LIBVIRT_DOMAIN_PATH "/var/run/libvirt/qemu/"
+
+/* We can convert pid to domain name of a guest when we use libvirt. */
+static char *get_guest_domain_from_pid(int pid)
+{
+ struct dirent *dirent;
+ char file_name[NAME_MAX];
+ char *file_name_ret, *domain;
+ char buf[BUFSIZ];
+ DIR *dir;
+ size_t doml;
+ int fd;
+
+ dir = opendir(LIBVIRT_DOMAIN_PATH);
+ if (!dir) {
+ if (errno == ENOENT)
+ warning("Only support for using libvirt");
+ return NULL;
+ }
+
+ for (dirent = readdir(dir); dirent != NULL; dirent = readdir(dir)) {
+ snprintf(file_name, NAME_MAX, LIBVIRT_DOMAIN_PATH"%s",
+ dirent->d_name);
+ file_name_ret = strstr(file_name, ".pid");
+ if (file_name_ret) {
+ fd = open(file_name, O_RDONLY);
+ if (fd < 0)
+ return NULL;
+ if (read(fd, buf, BUFSIZ) < 0)
+ return NULL;
+
+ if (pid == atoi(buf)) {
+ /* not include /var/run/libvirt/qemu */
+ doml = (size_t)(file_name_ret - file_name)
+ - strlen(LIBVIRT_DOMAIN_PATH);
+ domain = strndup(file_name +
+ strlen(LIBVIRT_DOMAIN_PATH),
+ doml);
+ plog("start %s\n", domain);
+ return domain;
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static int do_connection_virt(int cfd)
+{
+ struct ucred cr;
+ socklen_t cl;
+ int ret;
+ char *domain;
+
+ cl = sizeof(cr);
+ ret = getsockopt(cfd, SOL_SOCKET, SO_PEERCRED, &cr, &cl);
+ if (ret < 0)
+ return ret;
+
+ domain = get_guest_domain_from_pid(cr.pid);
+ if (!domain)
+ return -1;
+
+ return do_connection(cfd, NULL, NULL, domain, cr.pid);
+}
+
static int *client_pids;
static int saved_pids;
static int size_pids;
@@ -521,12 +718,11 @@ static void remove_process(int pid)

static void kill_clients(void)
{
- int status;
int i;

for (i = 0; i < saved_pids; i++) {
kill(client_pids[i], SIGINT);
- waitpid(client_pids[i], &status, 0);
+ waitpid(client_pids[i], NULL, 0);
}

saved_pids = 0;
@@ -545,31 +741,51 @@ static void clean_up(int sig)
} while (ret > 0);
}

-static void do_accept_loop(int sfd)
+static void do_accept_loop(int sfd, bool nw, struct sockaddr *addr,
+ socklen_t *addrlen)
{
- struct sockaddr_storage peer_addr;
- socklen_t peer_addr_len;
int cfd, pid;

- peer_addr_len = sizeof(peer_addr);
-
do {
- cfd = accept(sfd, (struct sockaddr *)&peer_addr,
- &peer_addr_len);
+ cfd = accept(sfd, addr, addrlen);
printf("connected!\n");
if (cfd < 0 && errno == EINTR)
continue;
if (cfd < 0)
pdie("connecting");

- pid = do_connection(cfd, &peer_addr, peer_addr_len);
+ if (nw)
+ pid = do_connection_nw(cfd, addr, addrlen);
+ else
+ pid = do_connection_virt(cfd);
if (pid > 0)
add_process(pid);

} while (!done);
}

-static void do_listen(char *port)
+static void do_accept_loop_nw(int sfd)
+{
+ struct sockaddr_storage peer_addr;
+ socklen_t peer_addr_len;
+
+ peer_addr_len = sizeof(peer_addr);
+
+ do_accept_loop(sfd, true, (struct sockaddr *)&peer_addr,
+ &peer_addr_len);
+}
+
+static void do_accept_loop_virt(int sfd)
+{
+ struct sockaddr_un un_addr;
+ socklen_t un_addrlen;
+
+ un_addrlen = sizeof(un_addr);
+
+ do_accept_loop(sfd, false, (struct sockaddr *)&un_addr, &un_addrlen);
+}
+
+static void do_listen_nw(char *port)
{
struct addrinfo hints;
struct addrinfo *result, *rp;
@@ -607,8 +823,64 @@ static void do_listen(char *port)
if (listen(sfd, backlog) < 0)
pdie("listen");

- do_accept_loop(sfd);
+ do_accept_loop_nw(sfd);
+
+ kill_clients();
+}
+
+static void make_virt_if_dir(void)
+{
+ struct group *group;
+
+ if (mkdir(TRACE_CMD_DIR, 0710) < 0) {
+ if (errno != EEXIST)
+ pdie("mkdir %s", TRACE_CMD_DIR);
+ }
+ /* QEMU operates as qemu:qemu */
+ chmod(TRACE_CMD_DIR, 0710);
+ group = getgrnam("qemu");
+ if (chown(TRACE_CMD_DIR, -1, group->gr_gid) < 0)
+ pdie("chown %s", TRACE_CMD_DIR);
+
+ if (mkdir(VIRT_DIR, 0710) < 0) {
+ if (errno != EEXIST)
+ pdie("mkdir %s", VIRT_DIR);
+ }
+ chmod(VIRT_DIR, 0710);
+ if (chown(VIRT_DIR, -1, group->gr_gid) < 0)
+ pdie("chown %s", VIRT_DIR);
+}
+
+static void do_listen_virt(void)
+{
+ struct sockaddr_un un_server;
+ struct group *group;
+ socklen_t slen;
+ int sfd;
+
+ make_virt_if_dir();
+
+ slen = sizeof(un_server);
+ sfd = socket(AF_UNIX, SOCK_STREAM, 0);
+ if (sfd < 0)
+ pdie("socket");
+
+ un_server.sun_family = AF_UNIX;
+ snprintf(un_server.sun_path, PATH_MAX, VIRT_TRACE_CTL_SOCK);
+
+ if (bind(sfd, (struct sockaddr *)&un_server, slen) < 0)
+ pdie("bind");
+ chmod(VIRT_TRACE_CTL_SOCK, 0660);
+ group = getgrnam("qemu");
+ if (chown(VIRT_TRACE_CTL_SOCK, -1, group->gr_gid) < 0)
+ pdie("fchown %s", VIRT_TRACE_CTL_SOCK);
+
+ if (listen(sfd, backlog) < 0)
+ pdie("listen");
+
+ do_accept_loop_virt(sfd);

+ unlink(VIRT_TRACE_CTL_SOCK);
kill_clients();
}

@@ -628,11 +900,17 @@ void trace_listen(int argc, char **argv)
char *port = NULL;
int daemon = 0;
int c;
+ int nw = 0;
+ int virt = 0;

if (argc < 2)
usage(argv);

- if (strcmp(argv[1], "listen") != 0)
+ if ((nw = (strcmp(argv[1], "listen") == 0)))
+ ; /* do nothing */
+ else if ((virt = (strcmp(argv[1], "virt-server") == 0)))
+ ; /* do nothing */
+ else
usage(argv);

for (;;) {
@@ -653,6 +931,8 @@ void trace_listen(int argc, char **argv)
usage(argv);
break;
case 'p':
+ if (virt)
+ die("-p only available with listen");
port = optarg;
break;
case 'd':
@@ -675,7 +955,7 @@ void trace_listen(int argc, char **argv)
}
}

- if (!port)
+ if (!port && nw)
usage(argv);

if ((argc - optind) >= 2)
@@ -703,7 +983,10 @@ void trace_listen(int argc, char **argv)
signal_setup(SIGINT, finish);
signal_setup(SIGTERM, finish);

- do_listen(port);
+ if (nw)
+ do_listen_nw(port);
+ else
+ do_listen_virt();

return;
}
diff --git a/trace-msg.c b/trace-msg.c
index 36117cd..251e99c 100644
--- a/trace-msg.c
+++ b/trace-msg.c
@@ -249,11 +249,13 @@ static int make_rinit(struct tracecmd_msg *msg)

msg->data.rinit.cpus = htonl(cpu_count);

- for (i = 0; i < cpu_count; i++) {
- /* + rrqports->cpus or rrqports->port_array[i] */
- offset += sizeof(be32);
- port = htonl(port_array[i]);
- bufcpy(msg, offset, &port, sizeof(be32) * cpu_count);
+ if (port_array) {
+ for (i = 0; i < cpu_count; i++) {
+ /* + rrqports->cpus or rrqports->port_array[i] */
+ offset += sizeof(be32);
+ port = htonl(port_array[i]);
+ bufcpy(msg, offset, &port, sizeof(be32) * cpu_count);
+ }
}

return 0;
@@ -565,22 +567,41 @@ static void error_operation_for_server(struct tracecmd_msg *msg)
warning("Message: cmd=%d size=%d\n", cmd, ntohl(msg->size));
}

-int tracecmd_msg_set_connection(int fd)
+int tracecmd_msg_set_connection(int fd, bool nw)
{
struct tracecmd_msg *msg;
- char buf[TRACECMD_MSG_MAX_LEN];
+ char buf[TRACECMD_MSG_MAX_LEN] = {};
u32 cmd;
int ret;

/* wait for connection msg by a client first */
- ret = tracecmd_msg_recv_wait(fd, buf, &msg);
- if (ret < 0) {
- if (ret == -ETIMEDOUT)
- /* network connection will be started soon */
- warning("No connection message");
- else
- warning("Disconnect");
- return ret;
+ if (nw) {
+ ret = tracecmd_msg_recv_wait(fd, buf, &msg);
+ if (ret < 0) {
+ if (ret == -ETIMEDOUT)
+ /* network connection will be started soon */
+ warning("No connection message");
+ else
+ warning("Disconnect");
+ return ret;
+ }
+ } else {
+ /*
+ * If a client uses virtio-serial, a connection message will
+ * not be sent immediately after accept(). connect() is called
+ * in QEMU, so the client can send the connection message
+ * after guest boots. Therefore, the virt-server patiently
+ * waits for the connection request of a client.
+ */
+ ret = tracecmd_msg_recv(fd, buf);
+ if (ret < 0) {
+ if (!buf[0]) {
+ /* No data means QEMU has already died. */
+ close(fd);
+ die("Connection refuesd");
+ }
+ return -ENOMSG;
+ }
}

msg = (struct tracecmd_msg *)buf;
diff --git a/trace-recorder.c b/trace-recorder.c
index 520d486..8169dc3 100644
--- a/trace-recorder.c
+++ b/trace-recorder.c
@@ -149,19 +149,23 @@ tracecmd_create_buffer_recorder_fd2(int fd, int fd2, int cpu, unsigned flags,
recorder->fd1 = fd;
recorder->fd2 = fd2;

- path = malloc_or_die(strlen(buffer) + 40);
- if (!path)
- goto out_free;
-
- if (flags & TRACECMD_RECORD_SNAPSHOT)
- sprintf(path, "%s/per_cpu/cpu%d/snapshot_raw", buffer, cpu);
- else
- sprintf(path, "%s/per_cpu/cpu%d/trace_pipe_raw", buffer, cpu);
- recorder->trace_fd = open(path, O_RDONLY);
- if (recorder->trace_fd < 0)
- goto out_free;
-
- free(path);
+ if (buffer) {
+ path = malloc_or_die(strlen(buffer) + 40);
+ if (!path)
+ goto out_free;
+
+ if (flags & TRACECMD_RECORD_SNAPSHOT)
+ sprintf(path, "%s/per_cpu/cpu%d/snapshot_raw",
+ buffer, cpu);
+ else
+ sprintf(path, "%s/per_cpu/cpu%d/trace_pipe_raw",
+ buffer, cpu);
+ recorder->trace_fd = open(path, O_RDONLY);
+ if (recorder->trace_fd < 0)
+ goto out_free;
+
+ free(path);
+ }

if ((recorder->flags & TRACECMD_RECORD_NOSPLICE) == 0) {
ret = pipe(recorder->brass);
@@ -184,8 +188,9 @@ tracecmd_create_buffer_recorder_fd(int fd, int cpu, unsigned flags, const char *
return tracecmd_create_buffer_recorder_fd2(fd, -1, cpu, flags, buffer, 0);
}

-struct tracecmd_recorder *
-tracecmd_create_buffer_recorder(const char *file, int cpu, unsigned flags, const char *buffer)
+static struct tracecmd_recorder *
+__tracecmd_create_buffer_recorder(const char *file, int cpu, unsigned flags,
+ const char *buffer)
{
struct tracecmd_recorder *recorder;
int fd;
@@ -248,6 +253,25 @@ tracecmd_create_buffer_recorder_maxkb(const char *file, int cpu, unsigned flags,
goto out;
}

+struct tracecmd_recorder *
+tracecmd_create_buffer_recorder(const char *file, int cpu, unsigned flags,
+ const char *buffer)
+{
+ return __tracecmd_create_buffer_recorder(file, cpu, flags, buffer);
+}
+
+struct tracecmd_recorder *
+tracecmd_create_recorder_virt(const char *file, int cpu, int trace_fd)
+{
+ struct tracecmd_recorder *recorder;
+
+ recorder = __tracecmd_create_buffer_recorder(file, cpu, 0, NULL);
+ if (recorder)
+ recorder->trace_fd = trace_fd;
+
+ return recorder;
+}
+
struct tracecmd_recorder *tracecmd_create_recorder_fd(int fd, int cpu, unsigned flags)
{
char *tracing;
diff --git a/trace-usage.c b/trace-usage.c
index b8f26e6..e6a239f 100644
--- a/trace-usage.c
+++ b/trace-usage.c
@@ -153,6 +153,16 @@ static struct usage_help usage_help[] = {
" -l logfile to write messages to.\n"
},
{
+ "virt-server",
+ "listen on a virtio-serial for trace clients",
+ " %s virt-server [-o file][-d dir][-l logfile]\n"
+ " Creates a socket to listen for clients.\n"
+ " -D create it in daemon mode.\n"
+ " -o file name to use for clients.\n"
+ " -d diretory to store client files.\n"
+ " -l logfile to write messages to.\n"
+ },
+ {
"list",
"list the available events, plugins or options",
" %s list [-e][-t][-o][-f [regex]]\n"

2013-08-19 09:42:20

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 05/11] [CLEANUP] trace-cmd: Split out the connect waiting loop from do_listen()

Split out the connect waiting loop from do_listen() for avoiding duplicate codes
between listen mode and virt-server mode.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-listen.c | 45 ++++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/trace-listen.c b/trace-listen.c
index 6c1bcac..c741fa4 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -580,14 +580,35 @@ static void clean_up(int sig)
} while (ret > 0);
}

+static void do_accept_loop(int sfd)
+{
+ struct sockaddr_storage peer_addr;
+ socklen_t peer_addr_len;
+ int cfd, pid;
+
+ peer_addr_len = sizeof(peer_addr);
+
+ do {
+ cfd = accept(sfd, (struct sockaddr *)&peer_addr,
+ &peer_addr_len);
+ printf("connected!\n");
+ if (cfd < 0 && errno == EINTR)
+ continue;
+ if (cfd < 0)
+ pdie("connecting");
+
+ pid = do_connection(cfd, &peer_addr, peer_addr_len);
+ if (pid > 0)
+ add_process(pid);
+
+ } while (!done);
+}
+
static void do_listen(char *port)
{
struct addrinfo hints;
struct addrinfo *result, *rp;
- int sfd, s, cfd;
- struct sockaddr_storage peer_addr;
- socklen_t peer_addr_len;
- int pid;
+ int sfd, s;

if (!debug)
signal_setup(SIGCHLD, clean_up);
@@ -621,21 +642,7 @@ static void do_listen(char *port)
if (listen(sfd, backlog) < 0)
pdie("listen");

- peer_addr_len = sizeof(peer_addr);
-
- do {
- cfd = accept(sfd, (struct sockaddr *)&peer_addr, &peer_addr_len);
- printf("connected!\n");
- if (cfd < 0 && errno == EINTR)
- continue;
- if (cfd < 0)
- pdie("connecting");
-
- pid = do_connection(cfd, &peer_addr, peer_addr_len);
- if (pid > 0)
- add_process(pid);
-
- } while (!done);
+ do_accept_loop(sfd);

kill_clients();
}

2013-08-19 09:44:12

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 06/11] [CLEANUP] trace-cmd: Split out the communication with client from process_client()

Split out the communication with client from process_client() for avoiding
duplicate codes between listen mode and virt-server mode.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-listen.c | 163 ++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 116 insertions(+), 47 deletions(-)

diff --git a/trace-listen.c b/trace-listen.c
index c741fa4..f29dd35 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -283,22 +283,12 @@ static int open_udp(const char *node, const char *port, int *pid,
return num_port;
}

-static void process_client(const char *node, const char *port, int fd)
+static int communicate_with_client(int fd, int *cpus, int *pagesize)
{
- char **temp_files;
char buf[BUFSIZ];
char *option;
- int *port_array;
- int *pid_array;
- int pagesize;
- int start_port;
- int udp_port;
int options;
int size;
- int cpus;
- int cpu;
- int pid;
- int ofd;
int n, s, t, i;

/* Let the client know what we are */
@@ -308,31 +298,31 @@ static void process_client(const char *node, const char *port, int fd)
n = read_string(fd, buf, BUFSIZ);
if (n == BUFSIZ)
/** ERROR **/
- return;
+ return -1;

- cpus = atoi(buf);
+ *cpus = atoi(buf);

- plog("cpus=%d\n", cpus);
- if (cpus < 0)
- return;
+ plog("cpus=%d\n", *cpus);
+ if (*cpus < 0)
+ return -1;

/* next read the page size */
n = read_string(fd, buf, BUFSIZ);
if (n == BUFSIZ)
/** ERROR **/
- return;
+ return -1;

- pagesize = atoi(buf);
+ *pagesize = atoi(buf);

- plog("pagesize=%d\n", pagesize);
- if (pagesize <= 0)
- return;
+ plog("pagesize=%d\n", *pagesize);
+ if (*pagesize <= 0)
+ return -1;

/* Now the number of options */
n = read_string(fd, buf, BUFSIZ);
if (n == BUFSIZ)
/** ERROR **/
- return;
+ return -1;

options = atoi(buf);

@@ -341,18 +331,18 @@ static void process_client(const char *node, const char *port, int fd)
n = read_string(fd, buf, BUFSIZ);
if (n == BUFSIZ)
/** ERROR **/
- return;
+ return -1;
size = atoi(buf);
/* prevent a client from killing us */
if (size > MAX_OPTION_SIZE)
- return;
+ return -1;
option = malloc_or_die(size);
do {
t = size;
s = 0;
s = read(fd, option+s, t);
if (s <= 0)
- return;
+ return -1;
t -= s;
s = size - t;
} while (t);
@@ -361,18 +351,53 @@ static void process_client(const char *node, const char *port, int fd)
free(option);
/* do we understand this option? */
if (!s)
- return;
+ return -1;
}

if (use_tcp)
plog("Using TCP for live connection\n");

- /* Create the client file */
+ return 0;
+}
+
+static int create_client_file(const char *node, const char *port)
+{
+ char buf[BUFSIZ];
+ int ofd;
+
snprintf(buf, BUFSIZ, "%s.%s:%s.dat", output_file, node, port);

ofd = open(buf, O_RDWR | O_CREAT | O_TRUNC, 0644);
if (ofd < 0)
pdie("Can not create file %s", buf);
+ return ofd;
+}
+
+static void destroy_all_readers(int cpus, int *pid_array, const char *node,
+ const char *port)
+{
+ int cpu;
+
+ for (cpu = 0; cpu < cpus; cpu++) {
+ if (pid_array[cpu] > 0) {
+ kill(pid_array[cpu], SIGKILL);
+ waitpid(pid_array[cpu], NULL, 0);
+ delete_temp_file(node, port, cpu);
+ pid_array[cpu] = 0;
+ }
+ }
+}
+
+static int *create_all_readers(int cpus, const char *node, const char *port,
+ int pagesize, int fd)
+{
+ char buf[BUFSIZ];
+ int *port_array;
+ int *pid_array;
+ int start_port;
+ int udp_port;
+ int cpu;
+ int pid;

port_array = malloc_or_die(sizeof(int) * cpus);
pid_array = malloc_or_die(sizeof(int) * cpus);
@@ -382,13 +407,17 @@ static void process_client(const char *node, const char *port, int fd)

/* Now create a UDP port for each CPU */
for (cpu = 0; cpu < cpus; cpu++) {
- udp_port = open_udp(node, port, &pid, cpu, pagesize, start_port);
+ udp_port = open_udp(node, port, &pid, cpu,
+ pagesize, start_port);
if (udp_port < 0)
goto out_free;
port_array[cpu] = udp_port;
pid_array[cpu] = pid;
- /* due to some bugging finding ports, force search after last port */
- start_port = udp_port+1;
+ /*
+ * due to some bugging finding ports,
+ * force search after last port
+ */
+ start_port = udp_port + 1;
}

/* send the client a comma deliminated set of port numbers */
@@ -400,9 +429,20 @@ static void process_client(const char *node, const char *port, int fd)
/* end with null terminator */
write(fd, "\0", 1);

- /* Now we are ready to start reading data from the client */
+ return pid_array;
+
+ out_free:
+ destroy_all_readers(cpus, pid_array, node, port);
+ return NULL;
+}
+
+static void collect_metadata_from_client(int ifd, int ofd)
+{
+ char buf[BUFSIZ];
+ int n, s, t;
+
do {
- n = read(fd, buf, BUFSIZ);
+ n = read(ifd, buf, BUFSIZ);
if (n < 0) {
if (errno == EINTR)
continue;
@@ -411,7 +451,7 @@ static void process_client(const char *node, const char *port, int fd)
t = n;
s = 0;
do {
- s = write(ofd, buf+s, t);
+ s = write(ofd, buf + s, t);
if (s < 0) {
if (errno == EINTR)
break;
@@ -421,18 +461,23 @@ static void process_client(const char *node, const char *port, int fd)
s = n - t;
} while (t);
} while (n > 0 && !done);
+}

- /* wait a little to let our readers finish reading */
- sleep(1);
+static void stop_all_readers(int cpus, int *pid_array)
+{
+ int cpu;

- /* stop our readers */
for (cpu = 0; cpu < cpus; cpu++) {
if (pid_array[cpu] > 0)
kill(pid_array[cpu], SIGUSR1);
}
+}

- /* wait a little to have the readers clean up */
- sleep(1);
+static void put_together_file(int cpus, int ofd, const char *node,
+ const char *port)
+{
+ char **temp_files;
+ int cpu;

/* Now put together the file */
temp_files = malloc_or_die(sizeof(*temp_files) * cpus);
@@ -441,16 +486,40 @@ static void process_client(const char *node, const char *port, int fd)
temp_files[cpu] = get_temp_file(node, port, cpu);

tracecmd_attach_cpu_data_fd(ofd, cpus, temp_files);
+ free(temp_files);
+}

- out_free:
- for (cpu = 0; cpu < cpus; cpu++) {
- if (pid_array[cpu] > 0) {
- kill(pid_array[cpu], SIGKILL);
- waitpid(pid_array[cpu], NULL, 0);
- delete_temp_file(node, port, cpu);
- pid_array[cpu] = 0;
- }
- }
+static void process_client(const char *node, const char *port, int fd)
+{
+ int *pid_array;
+ int pagesize;
+ int cpus;
+ int ofd;
+
+ if (communicate_with_client(fd, &cpus, &pagesize) < 0)
+ return;
+
+ ofd = create_client_file(node, port);
+
+ pid_array = create_all_readers(cpus, node, port, pagesize, fd);
+ if (!pid_array)
+ return;
+
+ /* Now we are ready to start reading data from the client */
+ collect_metadata_from_client(fd, ofd);
+
+ /* wait a little to let our readers finish reading */
+ sleep(1);
+
+ /* stop our readers */
+ stop_all_readers(cpus, pid_array);
+
+ /* wait a little to have the readers clean up */
+ sleep(1);
+
+ put_together_file(cpus, ofd, node, port);
+
+ destroy_all_readers(cpus, pid_array, node, port);
}

static int do_fork(int cfd)

2013-08-19 09:44:32

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 04/11] [CLEANUP] trace-cmd: Split out the communication with listener from setup_network()

Split out the communication with listener from the setup_network() for avoiding
duplicate codes between listen mode and virt-server mode.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-record.c | 119 +++++++++++++++++++++++++++++---------------------------
1 file changed, 62 insertions(+), 57 deletions(-)

diff --git a/trace-record.c b/trace-record.c
index ac9f70a..6a096bd 100644
--- a/trace-record.c
+++ b/trace-record.c
@@ -1607,59 +1607,13 @@ static int create_recorder(struct buffer_instance *instance, int cpu, int extrac
exit(0);
}

-static void setup_network(void)
+static void communicate_with_listener(int fd)
{
- struct tracecmd_output *handle;
- struct addrinfo hints;
- struct addrinfo *result, *rp;
- int sfd, s;
- ssize_t n;
char buf[BUFSIZ];
- char *server;
- char *port;
- char *p;
- int cpu;
- int i;
-
- if (!strchr(host, ':')) {
- server = strdup("localhost");
- if (!server)
- die("alloctating server");
- port = host;
- host = server;
- } else {
- host = strdup(host);
- if (!host)
- die("alloctating server");
- server = strtok_r(host, ":", &p);
- port = strtok_r(NULL, ":", &p);
- }
-
- memset(&hints, 0, sizeof(hints));
- hints.ai_family = AF_UNSPEC;
- hints.ai_socktype = SOCK_STREAM;
-
- s = getaddrinfo(server, port, &hints, &result);
- if (s != 0)
- die("getaddrinfo: %s", gai_strerror(s));
-
- for (rp = result; rp != NULL; rp = rp->ai_next) {
- sfd = socket(rp->ai_family, rp->ai_socktype,
- rp->ai_protocol);
- if (sfd == -1)
- continue;
-
- if (connect(sfd, rp->ai_addr, rp->ai_addrlen) != -1)
- break;
- close(sfd);
- }
-
- if (!rp)
- die("Can not connect to %s:%s", server, port);
-
- freeaddrinfo(result);
+ ssize_t n;
+ int cpu, i;

- n = read(sfd, buf, 8);
+ n = read(fd, buf, 8);

/* Make sure the server is the tracecmd server */
if (memcmp(buf, "tracecmd", 8) != 0)
@@ -1670,13 +1624,13 @@ static void setup_network(void)
sprintf(buf, "%d", cpu_count);

/* include \0 */
- write(sfd, buf, strlen(buf)+1);
+ write(fd, buf, strlen(buf)+1);

/* write the pagesize (in ASCII) */
sprintf(buf, "%d", page_size);

/* include \0 */
- write(sfd, buf, strlen(buf)+1);
+ write(fd, buf, strlen(buf)+1);

/*
* If we are using IPV4 and our page size is greater than
@@ -1691,14 +1645,14 @@ static void setup_network(void)

if (use_tcp) {
/* Send one option */
- write(sfd, "1", 2);
+ write(fd, "1", 2);
/* Size 4 */
- write(sfd, "4", 2);
+ write(fd, "4", 2);
/* use TCP */
- write(sfd, "TCP", 4);
+ write(fd, "TCP", 4);
} else
/* No options */
- write(sfd, "0", 2);
+ write(fd, "0", 2);

client_ports = malloc_or_die(sizeof(int) * cpu_count);

@@ -1708,7 +1662,7 @@ static void setup_network(void)
*/
for (cpu = 0; cpu < cpu_count; cpu++) {
for (i = 0; i < BUFSIZ; i++) {
- n = read(sfd, buf+i, 1);
+ n = read(fd, buf+i, 1);
if (n != 1)
die("Error, reading server ports");
if (!buf[i] || buf[i] == ',')
@@ -1719,6 +1673,57 @@ static void setup_network(void)
buf[i] = 0;
client_ports[cpu] = atoi(buf);
}
+}
+
+static void setup_network(void)
+{
+ struct tracecmd_output *handle;
+ struct addrinfo hints;
+ struct addrinfo *result, *rp;
+ int sfd, s;
+ char *server;
+ char *port;
+ char *p;
+
+ if (!strchr(host, ':')) {
+ server = strdup("localhost");
+ if (!server)
+ die("alloctating server");
+ port = host;
+ host = server;
+ } else {
+ host = strdup(host);
+ if (!host)
+ die("alloctating server");
+ server = strtok_r(host, ":", &p);
+ port = strtok_r(NULL, ":", &p);
+ }
+
+ memset(&hints, 0, sizeof(hints));
+ hints.ai_family = AF_UNSPEC;
+ hints.ai_socktype = SOCK_STREAM;
+
+ s = getaddrinfo(server, port, &hints, &result);
+ if (s != 0)
+ die("getaddrinfo: %s", gai_strerror(s));
+
+ for (rp = result; rp != NULL; rp = rp->ai_next) {
+ sfd = socket(rp->ai_family, rp->ai_socktype,
+ rp->ai_protocol);
+ if (sfd == -1)
+ continue;
+
+ if (connect(sfd, rp->ai_addr, rp->ai_addrlen) != -1)
+ break;
+ close(sfd);
+ }
+
+ if (!rp)
+ die("Can not connect to %s:%s", server, port);
+
+ freeaddrinfo(result);
+
+ communicate_with_listener(sfd);

/* Now create the handle through this socket */
handle = tracecmd_create_init_fd_glob(sfd, listed_events);

2013-08-19 09:44:57

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: [RFC PATCH 02/11] [BUGFIX] trace-cmd: Add waitpid() when recorders are destoried

Add waitpid() when recorders are destroyed for avoiding zombie processes.
When udp_port is inappropriate, a parent process will destroy all child recorder
processes. Currently, the process does not wait for the termination of the
children, so those recorders can become zombie processes if the parent process
cannot die at once due to any reasons.

Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
---
trace-listen.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/trace-listen.c b/trace-listen.c
index dec1c00..6c1bcac 100644
--- a/trace-listen.c
+++ b/trace-listen.c
@@ -446,6 +446,7 @@ static void process_client(const char *node, const char *port, int fd)
for (cpu = 0; cpu < cpus; cpu++) {
if (pid_array[cpu] > 0) {
kill(pid_array[cpu], SIGKILL);
+ waitpid(pid_array[cpu], NULL, 0);
delete_temp_file(node, port, cpu);
pid_array[cpu] = 0;
}

2013-08-20 16:00:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

On Mon, 19 Aug 2013 18:46:20 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:


> d) merge feature of trace data of multiple guests and a host in chronological
> order
> Current trace-cmd cannot merge trace data of multiple guests and a host in
> chronological order. If an user wants to analyze an I/O delay problem of a
> guest, the user will want to check trace data of all guests and the host in a
> file. However, trace-cmd does not support a merge feature yet, the user must
> make a merging script. So, trace-cmd had better support a merge feature for
> multiple files for virtualization.
>

This is incorrect. trace-cmd already has a merge feature for multiple
machines.

If you have two boxes that are in sync by ntp, you can do the following:

On box1:

trace-cmd record --date -o trace-box1.dat -e all ping box2

On box2:

trace-cmd record --date -o trace-box2.dat -e all ping box1


And then copy over trace-box2.dat to box1 and run

trace-cmd report -i trace-box1.dat -i trace-box2.dat

And you will see a merge. I just did this on two of my boxes called ixf
and bxtest and here's a partial output:

trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682807: sys_enter: NR 2 (22e2b00, 241, 1a4, 1a, 4361ac, 3)
trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682808: sys_enter_open: filename: 0x022e2b00, flags: 0x00000241, mode: 0x000001a4
trace-ixf.dat: <idle>-0 [002] 1377013771.682808: hrtimer_cancel: hrtimer=0xffff880002110820
trace-ixf.dat: <idle>-0 [002] 1377013771.682808: hrtimer_expire_entry: hrtimer=0xffff880002110820 now=673528250850 function=tick_sched_timer/0x0
trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682809:
kmem_cache_alloc: (getname_flags+0x37) call_site=ffffffff8117b797 ptr=0xffff8800d38c0000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNELGFP_NOTRACK


The --date option is used because the two machines are not in sync with
the trace time stamp. What the date option does, is to sync the
timestamp up with the gettimeofday and the output reports that. This
allows the two boxes to report information that is relatively close to
how the two interacted.

If the guest and the host have the same clock, then the --date option
is not needed and the two should be able to be merged normally.

Also, I haven't released it yet (will soon), but trace-cmd handles
multiple buffers too. That is, with the multiple buffers that ftrace
has, it will create and read from them as well as report them.

I'll finish my testing on all the latest features of trace-cmd I have
and push it out later today.

I'll also take a look at the rest of your patches.

Thanks,

-- Steve

2013-08-20 17:15:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 05/11] [CLEANUP] trace-cmd: Split out the connect waiting loop from do_listen()

On Mon, 19 Aug 2013 18:46:32 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> Split out the connect waiting loop from do_listen() for avoiding duplicate codes
> between listen mode and virt-server mode.
>
> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> ---
> trace-listen.c | 45 ++++++++++++++++++++++++++-------------------
> 1 file changed, 26 insertions(+), 19 deletions(-)
>
> diff --git a/trace-listen.c b/trace-listen.c
> index 6c1bcac..c741fa4 100644
> --- a/trace-listen.c
> +++ b/trace-listen.c
> @@ -580,14 +580,35 @@ static void clean_up(int sig)
> } while (ret > 0);
> }
>
> +static void do_accept_loop(int sfd)
> +{
> + struct sockaddr_storage peer_addr;
> + socklen_t peer_addr_len;
> + int cfd, pid;

Just an FYI, and I don't really care too much. But my personal
preference, is to only join variables that are being used the same way.
That is, multiple file descriptors sure:

int sfd, s, cfd;

But I prefer to keep file descriptors separate from pid.

int cfd;
int pid;

But as I said, this is only a preference and I don't really care that
much about it to even bother to change the patch.

-- Steve

2013-08-20 17:18:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 05/11] [CLEANUP] trace-cmd: Split out the connect waiting loop from do_listen()

On Mon, 19 Aug 2013 18:46:32 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> Split out the connect waiting loop from do_listen() for avoiding duplicate codes
> between listen mode and virt-server mode.
>
> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> ---
> trace-listen.c | 45 ++++++++++++++++++++++++++-------------------
> 1 file changed, 26 insertions(+), 19 deletions(-)
>
> diff --git a/trace-listen.c b/trace-listen.c
> index 6c1bcac..c741fa4 100644
> --- a/trace-listen.c
> +++ b/trace-listen.c
> @@ -580,14 +580,35 @@ static void clean_up(int sig)
> } while (ret > 0);
> }
>
> +static void do_accept_loop(int sfd)
> +{
> + struct sockaddr_storage peer_addr;
> + socklen_t peer_addr_len;
> + int cfd, pid;
> +
> + peer_addr_len = sizeof(peer_addr);
> +
> + do {
> + cfd = accept(sfd, (struct sockaddr *)&peer_addr,
> + &peer_addr_len);

I'm also not that strict on the 80 character limit. This is fine, but
so is one line. If you go over a few characters, it's not the end of
the world. Like, by joining this line, it's only 81 characters.

-- Steve

> + printf("connected!\n");
> + if (cfd < 0 && errno == EINTR)
> + continue;
> + if (cfd < 0)
> + pdie("connecting");
> +
> + pid = do_connection(cfd, &peer_addr, peer_addr_len);
> + if (pid > 0)
> + add_process(pid);
> +
> + } while (!done);

2013-08-20 17:38:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 06/11] [CLEANUP] trace-cmd: Split out the communication with client from process_client()

On Mon, 19 Aug 2013 18:46:34 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> Split out the communication with client from process_client() for avoiding
> duplicate codes between listen mode and virt-server mode.
>

So far I only have cosmetic comments.


> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> ---
> trace-listen.c | 163 ++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 116 insertions(+), 47 deletions(-)
>
> diff --git a/trace-listen.c b/trace-listen.c
> index c741fa4..f29dd35 100644
> --- a/trace-listen.c
> +++ b/trace-listen.c
> @@ -283,22 +283,12 @@ static int open_udp(const char *node, const char *port, int *pid,
> return num_port;
> }
>
> -static void process_client(const char *node, const char *port, int fd)
> +static int communicate_with_client(int fd, int *cpus, int *pagesize)
> {
> - char **temp_files;
> char buf[BUFSIZ];
> char *option;
> - int *port_array;
> - int *pid_array;
> - int pagesize;
> - int start_port;
> - int udp_port;
> int options;
> int size;
> - int cpus;
> - int cpu;
> - int pid;
> - int ofd;
> int n, s, t, i;
>
> /* Let the client know what we are */
> @@ -308,31 +298,31 @@ static void process_client(const char *node, const char *port, int fd)
> n = read_string(fd, buf, BUFSIZ);
> if (n == BUFSIZ)
> /** ERROR **/
> - return;
> + return -1;
>
> - cpus = atoi(buf);
> + *cpus = atoi(buf);
>
> - plog("cpus=%d\n", cpus);
> - if (cpus < 0)
> - return;
> + plog("cpus=%d\n", *cpus);
> + if (*cpus < 0)
> + return -1;
>
> /* next read the page size */
> n = read_string(fd, buf, BUFSIZ);
> if (n == BUFSIZ)
> /** ERROR **/
> - return;
> + return -1;
>
> - pagesize = atoi(buf);
> + *pagesize = atoi(buf);
>
> - plog("pagesize=%d\n", pagesize);
> - if (pagesize <= 0)
> - return;
> + plog("pagesize=%d\n", *pagesize);
> + if (*pagesize <= 0)
> + return -1;
>
> /* Now the number of options */
> n = read_string(fd, buf, BUFSIZ);
> if (n == BUFSIZ)
> /** ERROR **/
> - return;
> + return -1;
>
> options = atoi(buf);
>
> @@ -341,18 +331,18 @@ static void process_client(const char *node, const char *port, int fd)
> n = read_string(fd, buf, BUFSIZ);
> if (n == BUFSIZ)
> /** ERROR **/
> - return;
> + return -1;
> size = atoi(buf);
> /* prevent a client from killing us */
> if (size > MAX_OPTION_SIZE)
> - return;
> + return -1;
> option = malloc_or_die(size);
> do {
> t = size;
> s = 0;
> s = read(fd, option+s, t);
> if (s <= 0)
> - return;
> + return -1;
> t -= s;
> s = size - t;
> } while (t);
> @@ -361,18 +351,53 @@ static void process_client(const char *node, const char *port, int fd)
> free(option);
> /* do we understand this option? */
> if (!s)
> - return;
> + return -1;
> }
>
> if (use_tcp)
> plog("Using TCP for live connection\n");
>
> - /* Create the client file */
> + return 0;
> +}
> +
> +static int create_client_file(const char *node, const char *port)
> +{
> + char buf[BUFSIZ];
> + int ofd;
> +
> snprintf(buf, BUFSIZ, "%s.%s:%s.dat", output_file, node, port);
>
> ofd = open(buf, O_RDWR | O_CREAT | O_TRUNC, 0644);
> if (ofd < 0)
> pdie("Can not create file %s", buf);
> + return ofd;
> +}
> +
> +static void destroy_all_readers(int cpus, int *pid_array, const char *node,
> + const char *port)
> +{
> + int cpu;
> +
> + for (cpu = 0; cpu < cpus; cpu++) {
> + if (pid_array[cpu] > 0) {
> + kill(pid_array[cpu], SIGKILL);
> + waitpid(pid_array[cpu], NULL, 0);
> + delete_temp_file(node, port, cpu);
> + pid_array[cpu] = 0;
> + }
> + }
> +}
> +
> +static int *create_all_readers(int cpus, const char *node, const char *port,
> + int pagesize, int fd)
> +{
> + char buf[BUFSIZ];
> + int *port_array;
> + int *pid_array;
> + int start_port;
> + int udp_port;
> + int cpu;
> + int pid;
>
> port_array = malloc_or_die(sizeof(int) * cpus);
> pid_array = malloc_or_die(sizeof(int) * cpus);
> @@ -382,13 +407,17 @@ static void process_client(const char *node, const char *port, int fd)
>
> /* Now create a UDP port for each CPU */
> for (cpu = 0; cpu < cpus; cpu++) {
> - udp_port = open_udp(node, port, &pid, cpu, pagesize, start_port);
> + udp_port = open_udp(node, port, &pid, cpu,
> + pagesize, start_port);

Again, no need to make multiple lines. I'm not sure it makes it look
any better.

> if (udp_port < 0)
> goto out_free;
> port_array[cpu] = udp_port;
> pid_array[cpu] = pid;
> - /* due to some bugging finding ports, force search after last port */
> - start_port = udp_port+1;
> + /*
> + * due to some bugging finding ports,
> + * force search after last port
> + */

Same here, the split looks rather funny. Do you work on a 80 character
terminal?

> + start_port = udp_port + 1;
> }
>
> /* send the client a comma deliminated set of port numbers */
> @@ -400,9 +429,20 @@ static void process_client(const char *node, const char *port, int fd)
> /* end with null terminator */
> write(fd, "\0", 1);
>
> - /* Now we are ready to start reading data from the client */
> + return pid_array;
> +
> + out_free:
> + destroy_all_readers(cpus, pid_array, node, port);
> + return NULL;
> +}
> +
> +static void collect_metadata_from_client(int ifd, int ofd)
> +{
> + char buf[BUFSIZ];
> + int n, s, t;
> +
> do {
> - n = read(fd, buf, BUFSIZ);
> + n = read(ifd, buf, BUFSIZ);
> if (n < 0) {
> if (errno == EINTR)
> continue;
> @@ -411,7 +451,7 @@ static void process_client(const char *node, const char *port, int fd)
> t = n;
> s = 0;
> do {
> - s = write(ofd, buf+s, t);
> + s = write(ofd, buf + s, t);

This is one of those exceptions to the rule. Even in the Linux kernel,
it's not consistent. As the "buf+s" shows more of a offset than a true
addition, it is usually preferable to not have spaces. Because when
reading the line in your head we have:

write(old, buf + s, t);

Is thought of as "add s to buf and pass the result will be where the
write will go to".


write(old, buf+s, t);

Is thought of as "write the result to buf at offset s". This is because
leaving out the spaces, correlates to the equivalent of:

write(old, &buf[s], t);

Which would be the same thing.

I'll update your patch to fix these minor cosmetic changes.

-- Steve



> if (s < 0) {
> if (errno == EINTR)
> break;
> @@ -421,18 +461,23 @@ static void process_client(const char *node, const char *port, int fd)
> s = n - t;
> } while (t);
> } while (n > 0 && !done);
> +}
>
> - /* wait a little to let our readers finish reading */
> - sleep(1);
> +static void stop_all_readers(int cpus, int *pid_array)
> +{
> + int cpu;
>
> - /* stop our readers */
> for (cpu = 0; cpu < cpus; cpu++) {
> if (pid_array[cpu] > 0)
> kill(pid_array[cpu], SIGUSR1);
> }
> +}
>
> - /* wait a little to have the readers clean up */
> - sleep(1);
> +static void put_together_file(int cpus, int ofd, const char *node,
> + const char *port)
> +{
> + char **temp_files;
> + int cpu;
>
> /* Now put together the file */
> temp_files = malloc_or_die(sizeof(*temp_files) * cpus);
> @@ -441,16 +486,40 @@ static void process_client(const char *node, const char *port, int fd)
> temp_files[cpu] = get_temp_file(node, port, cpu);
>
> tracecmd_attach_cpu_data_fd(ofd, cpus, temp_files);
> + free(temp_files);
> +}
>
> - out_free:
> - for (cpu = 0; cpu < cpus; cpu++) {
> - if (pid_array[cpu] > 0) {
> - kill(pid_array[cpu], SIGKILL);
> - waitpid(pid_array[cpu], NULL, 0);
> - delete_temp_file(node, port, cpu);
> - pid_array[cpu] = 0;
> - }
> - }
> +static void process_client(const char *node, const char *port, int fd)
> +{
> + int *pid_array;
> + int pagesize;
> + int cpus;
> + int ofd;
> +
> + if (communicate_with_client(fd, &cpus, &pagesize) < 0)
> + return;
> +
> + ofd = create_client_file(node, port);
> +
> + pid_array = create_all_readers(cpus, node, port, pagesize, fd);
> + if (!pid_array)
> + return;
> +
> + /* Now we are ready to start reading data from the client */
> + collect_metadata_from_client(fd, ofd);
> +
> + /* wait a little to let our readers finish reading */
> + sleep(1);
> +
> + /* stop our readers */
> + stop_all_readers(cpus, pid_array);
> +
> + /* wait a little to have the readers clean up */
> + sleep(1);
> +
> + put_together_file(cpus, ofd, node, port);
> +
> + destroy_all_readers(cpus, pid_array, node, port);
> }
>
> static int do_fork(int cfd)

2013-08-20 17:49:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 07/11] [CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()

On Mon, 19 Aug 2013 18:46:37 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> Split out binding a port and fork reader from open_udp() for avoiding duplicate
> codes between listen mode and virt-server mode.
>
> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
> ---
> trace-listen.c | 34 ++++++++++++++++++++++++++--------
> 1 file changed, 26 insertions(+), 8 deletions(-)
>
> diff --git a/trace-listen.c b/trace-listen.c
> index f29dd35..bf9ef9d 100644
> --- a/trace-listen.c
> +++ b/trace-listen.c
> @@ -228,13 +228,12 @@ static void process_udp_child(int sfd, const char *host, const char *port,
> #define START_PORT_SEARCH 1500
> #define MAX_PORT_SEARCH 6000
>
> -static int open_udp(const char *node, const char *port, int *pid,
> - int cpu, int pagesize, int start_port)
> +static int udp_bind_a_port(int start_port, int *sfd)
> {
> struct addrinfo hints;
> struct addrinfo *result, *rp;
> - int sfd, s;
> char buf[BUFSIZ];
> + int s;
> int num_port = start_port;
>
> again:
> @@ -250,15 +249,15 @@ static int open_udp(const char *node, const char *port, int *pid,
> pdie("getaddrinfo: error opening udp socket");
>
> for (rp = result; rp != NULL; rp = rp->ai_next) {
> - sfd = socket(rp->ai_family, rp->ai_socktype,
> - rp->ai_protocol);
> - if (sfd < 0)
> + *sfd = socket(rp->ai_family, rp->ai_socktype,
> + rp->ai_protocol);
> + if (*sfd < 0)
> continue;
>
> - if (bind(sfd, rp->ai_addr, rp->ai_addrlen) == 0)
> + if (bind(*sfd, rp->ai_addr, rp->ai_addrlen) == 0)
> break;
>
> - close(sfd);
> + close(*sfd);
> }
>
> if (rp == NULL) {
> @@ -270,6 +269,12 @@ static int open_udp(const char *node, const char *port, int *pid,
>
> freeaddrinfo(result);
>
> + return num_port;
> +}
> +
> +static void fork_udp_reader(int sfd, const char *node, const char *port,
> + int *pid, int cpu, int pagesize)
> +{
> *pid = fork();
>
> if (*pid < 0)
> @@ -279,6 +284,19 @@ static int open_udp(const char *node, const char *port, int *pid,
> process_udp_child(sfd, node, port, cpu, pagesize);
>
> close(sfd);
> +}
> +
> +static int open_udp(const char *node, const char *port, int *pid,
> + int cpu, int pagesize, int start_port)
> +{
> + int sfd;
> + int num_port;
> +
> + num_port = udp_bind_a_port(start_port, &sfd);
> + if (num_port < 0)
> + return num_port;

I don't see how num_port could be less than zero.

-- Steve

> +
> + fork_udp_reader(sfd, node, port, pid, cpu, pagesize);
>
> return num_port;
> }

2013-08-20 17:57:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

On Mon, 19 Aug 2013 18:46:39 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:


> This message protocol is incompatible with the previous unstructured message
> protocol. So, if an old(new)-version client tries to connect to an
> new(old)-version server, the operation should be stopped.
>

I'm a stickler for backward compatibility. I'm all for extensions.

I know this will just complicate things, but I don't mind that. What
should happen is, it should try to connect with the new protocol, if it
fails due to an older server, then it needs to fall back to the older
method, without the added features. We can freeze the older method if
need be. But I will not let a newer trace-cmd become incompatible with
an older version. I worked hard to keep it that way. There's only a few
exceptions to that.

Note, an older client needs to also work as is with a newer server.

Anyway, the old way only needs to stay the same, it does not need added
features. For that, a switch to the new way is needed.

If you need help in accomplishing this, I'll work with you on that.

Thanks!

-- Steve

2013-08-26 01:46:44

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

Hi Steven,

Thank you for reviewing my patches.
Sorry for the late reply.

(2013/08/21 1:00), Steven Rostedt wrote:
> On Mon, 19 Aug 2013 18:46:20 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>
>> d) merge feature of trace data of multiple guests and a host in chronological
>> order
>> Current trace-cmd cannot merge trace data of multiple guests and a host in
>> chronological order. If an user wants to analyze an I/O delay problem of a
>> guest, the user will want to check trace data of all guests and the host in a
>> file. However, trace-cmd does not support a merge feature yet, the user must
>> make a merging script. So, trace-cmd had better support a merge feature for
>> multiple files for virtualization.
>>
>
> This is incorrect. trace-cmd already has a merge feature for multiple
> machines.
>
> If you have two boxes that are in sync by ntp, you can do the following:
>
> On box1:
>
> trace-cmd record --date -o trace-box1.dat -e all ping box2
>
> On box2:
>
> trace-cmd record --date -o trace-box2.dat -e all ping box1
>
>
> And then copy over trace-box2.dat to box1 and run
>
> trace-cmd report -i trace-box1.dat -i trace-box2.dat
>
> And you will see a merge. I just did this on two of my boxes called ixf
> and bxtest and here's a partial output:
>
> trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682807: sys_enter: NR 2 (22e2b00, 241, 1a4, 1a, 4361ac, 3)
> trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682808: sys_enter_open: filename: 0x022e2b00, flags: 0x00000241, mode: 0x000001a4
> trace-ixf.dat: <idle>-0 [002] 1377013771.682808: hrtimer_cancel: hrtimer=0xffff880002110820
> trace-ixf.dat: <idle>-0 [002] 1377013771.682808: hrtimer_expire_entry: hrtimer=0xffff880002110820 now=673528250850 function=tick_sched_timer/0x0
> trace-bxtest.dat: trace-cmd-1348 [003] 1377013771.682809:
> kmem_cache_alloc: (getname_flags+0x37) call_site=ffffffff8117b797 ptr=0xffff8800d38c0000 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNELGFP_NOTRACK
>
>
> The --date option is used because the two machines are not in sync with
> the trace time stamp. What the date option does, is to sync the
> timestamp up with the gettimeofday and the output reports that. This
> allows the two boxes to report information that is relatively close to
> how the two interacted.

Oh, I didn't know the --date option.
As you mentioned, we can merge trace data in chronological order by
using --date option if the times of those machines are synchronized by
NTP.

> If the guest and the host have the same clock, then the --date option
> is not needed and the two should be able to be merged normally.

No, we can not assure that the guest and the host have the same clock
even if it is running on the same physical machine, because both kernel
doesn't share it, there is some difference between them. So, we still
need time synchronizing guest-host by NTP and --date option.

However, there are cases that times of those machines cannot be
synchronized. For example, although multiple users can run guests on
virtualization environments (e.g. multi-tenant cloud hosting), there
are no guarantee that they use the same NTP server. Moreover, even if
the times are synchronized, trace data cannot exactly be merged because
the NTP-synchronized time granularity may not be enough fine for
sorting guest-host switching events.

> Also, I haven't released it yet (will soon), but trace-cmd handles
> multiple buffers too. That is, with the multiple buffers that ftrace
> has, it will create and read from them as well as report them.

Is it commit ID d56f30679f9811a91ed471c8e081cc7ffbed1e62?
We can download the feature from your git repository.

> I'll finish my testing on all the latest features of trace-cmd I have
> and push it out later today.
>
> I'll also take a look at the rest of your patches.

Thank you!

Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-26 01:48:20

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 07/11] [CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()

(2013/08/21 2:49), Steven Rostedt wrote:
> On Mon, 19 Aug 2013 18:46:37 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>> Split out binding a port and fork reader from open_udp() for avoiding duplicate
>> codes between listen mode and virt-server mode.
>>
>> Signed-off-by: Yoshihiro YUNOMAE <[email protected]>
>> ---
>> trace-listen.c | 34 ++++++++++++++++++++++++++--------
>> 1 file changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/trace-listen.c b/trace-listen.c
>> index f29dd35..bf9ef9d 100644
>> --- a/trace-listen.c
>> +++ b/trace-listen.c
>> @@ -228,13 +228,12 @@ static void process_udp_child(int sfd, const char *host, const char *port,
>> #define START_PORT_SEARCH 1500
>> #define MAX_PORT_SEARCH 6000
>>
>> -static int open_udp(const char *node, const char *port, int *pid,
>> - int cpu, int pagesize, int start_port)
>> +static int udp_bind_a_port(int start_port, int *sfd)
>> {
>> struct addrinfo hints;
>> struct addrinfo *result, *rp;
>> - int sfd, s;
>> char buf[BUFSIZ];
>> + int s;
>> int num_port = start_port;
>>
>> again:
>> @@ -250,15 +249,15 @@ static int open_udp(const char *node, const char *port, int *pid,
>> pdie("getaddrinfo: error opening udp socket");
>>
>> for (rp = result; rp != NULL; rp = rp->ai_next) {
>> - sfd = socket(rp->ai_family, rp->ai_socktype,
>> - rp->ai_protocol);
>> - if (sfd < 0)
>> + *sfd = socket(rp->ai_family, rp->ai_socktype,
>> + rp->ai_protocol);
>> + if (*sfd < 0)
>> continue;
>>
>> - if (bind(sfd, rp->ai_addr, rp->ai_addrlen) == 0)
>> + if (bind(*sfd, rp->ai_addr, rp->ai_addrlen) == 0)
>> break;
>>
>> - close(sfd);
>> + close(*sfd);
>> }
>>
>> if (rp == NULL) {
>> @@ -270,6 +269,12 @@ static int open_udp(const char *node, const char *port, int *pid,
>>
>> freeaddrinfo(result);
>>
>> + return num_port;
>> +}
>> +
>> +static void fork_udp_reader(int sfd, const char *node, const char *port,
>> + int *pid, int cpu, int pagesize)
>> +{
>> *pid = fork();
>>
>> if (*pid < 0)
>> @@ -279,6 +284,19 @@ static int open_udp(const char *node, const char *port, int *pid,
>> process_udp_child(sfd, node, port, cpu, pagesize);
>>
>> close(sfd);
>> +}
>> +
>> +static int open_udp(const char *node, const char *port, int *pid,
>> + int cpu, int pagesize, int start_port)
>> +{
>> + int sfd;
>> + int num_port;
>> +
>> + num_port = udp_bind_a_port(start_port, &sfd);
>> + if (num_port < 0)
>> + return num_port;
>
> I don't see how num_port could be less than zero.

I think so, but trace-cmd checks whether udp_port is less than zero or
not in create_all_readers().

May I submit the removal patch?

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-26 01:50:38

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

(2013/08/21 2:56), Steven Rostedt wrote:
> On Mon, 19 Aug 2013 18:46:39 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>
>> This message protocol is incompatible with the previous unstructured message
>> protocol. So, if an old(new)-version client tries to connect to an
>> new(old)-version server, the operation should be stopped.
>>
>
> I'm a stickler for backward compatibility. I'm all for extensions.

No problem:)
I also worried about backward compatibility.

> I know this will just complicate things, but I don't mind that. What
> should happen is, it should try to connect with the new protocol, if it
> fails due to an older server, then it needs to fall back to the older
> method, without the added features. We can freeze the older method if
> need be. But I will not let a newer trace-cmd become incompatible with
> an older version. I worked hard to keep it that way. There's only a few
> exceptions to that.
>
> Note, an older client needs to also work as is with a newer server.
>
> Anyway, the old way only needs to stay the same, it does not need added
> features. For that, a switch to the new way is needed.

OK, I'll add the code switching to the new way in order to keep backward
compatibility. Would you give me your comments about following things?

<Network>
0. old server and old client
Old servers send "tracecmd" as the first message.
Old clients compare the first 8byte of the first message with "tracecmd".

1. new server
- Send "tracecmd-v2" as the first message.
- Check the reply message whether the message is "tracecmd-v2" or cpus
value.
If "tracecmd-v2", the server uses new protocol and wait for the
message MSG_TINIT.
If cpus value, the server uses old protocol.

2. new client
- Receive the first message.
- Check the message whether the message is "tracecmd-v2" or not.
If "tracecmd-v2", the client sends "tracecmd-v2" to the server. Then,
the client sends the message MSG_TINIT.
If "tracecmd", the client sends cpus value as the old protocol.

<Virtio-serial>
This is new feature, so trace-cmd does not need to keep backward
compatibility.

> If you need help in accomplishing this, I'll work with you on that.

Thank you for your kindness!

Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-26 14:22:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

On Mon, 26 Aug 2013 10:46:38 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> > The --date option is used because the two machines are not in sync with
> > the trace time stamp. What the date option does, is to sync the
> > timestamp up with the gettimeofday and the output reports that. This
> > allows the two boxes to report information that is relatively close to
> > how the two interacted.
>
> Oh, I didn't know the --date option.
> As you mentioned, we can merge trace data in chronological order by
> using --date option if the times of those machines are synchronized by
> NTP.
>
> > If the guest and the host have the same clock, then the --date option
> > is not needed and the two should be able to be merged normally.
>
> No, we can not assure that the guest and the host have the same clock
> even if it is running on the same physical machine, because both kernel
> doesn't share it, there is some difference between them. So, we still
> need time synchronizing guest-host by NTP and --date option.
>
> However, there are cases that times of those machines cannot be
> synchronized. For example, although multiple users can run guests on
> virtualization environments (e.g. multi-tenant cloud hosting), there
> are no guarantee that they use the same NTP server. Moreover, even if
> the times are synchronized, trace data cannot exactly be merged because
> the NTP-synchronized time granularity may not be enough fine for
> sorting guest-host switching events.

Right, unless there's some other means no synchronize between boxes,
this is currently the best we have.

>
> > Also, I haven't released it yet (will soon), but trace-cmd handles
> > multiple buffers too. That is, with the multiple buffers that ftrace
> > has, it will create and read from them as well as report them.
>
> Is it commit ID d56f30679f9811a91ed471c8e081cc7ffbed1e62?
> We can download the feature from your git repository.

Yep.

-- Steve

2013-08-26 14:37:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 07/11] [CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()

On Mon, 26 Aug 2013 10:48:15 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:


> >> +static int open_udp(const char *node, const char *port, int *pid,
> >> + int cpu, int pagesize, int start_port)
> >> +{
> >> + int sfd;
> >> + int num_port;
> >> +
> >> + num_port = udp_bind_a_port(start_port, &sfd);
> >> + if (num_port < 0)
> >> + return num_port;
> >
> > I don't see how num_port could be less than zero.
>
> I think so, but trace-cmd checks whether udp_port is less than zero or
> not in create_all_readers().
>
> May I submit the removal patch?

It's not a critical path, the check is fine. As there was a check, I
was thinking that you were expecting it to pass an error which it never
did.

Perhaps just add a comment stating that udp_bind_a_port() currently
does not return an error, but if that changes in the future, we have a
check for it now.

That way a reviewer of the code does not get confused by it.

-- Steve

2013-08-26 15:11:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

On Mon, 26 Aug 2013 10:50:33 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> <Network>
> 0. old server and old client
> Old servers send "tracecmd" as the first message.
> Old clients compare the first 8byte of the first message with "tracecmd".
>
> 1. new server
> - Send "tracecmd-v2" as the first message.
> - Check the reply message whether the message is "tracecmd-v2" or cpus
> value.
> If "tracecmd-v2", the server uses new protocol and wait for the
> message MSG_TINIT.
> If cpus value, the server uses old protocol.

That will kill an old client. Yes it only checks the first 8 bytes, but
then it gets back a comma separated list of CPUs. As it uses TCP for
this transaction, the unread "-v2" will end up being read as CPUs by
the client.

>
> 2. new client
> - Receive the first message.
> - Check the message whether the message is "tracecmd-v2" or not.
> If "tracecmd-v2", the client sends "tracecmd-v2" to the server. Then,
> the client sends the message MSG_TINIT.
> If "tracecmd", the client sends cpus value as the old protocol.
>

It will have to be the client that determines if the protocol is v2 or
not. Basically, after receiving the "tracecmd" the client can send back
"V2", and then expect the server to reply with "V2" as well. If the
server does not, then it disconnects from the server and then restarts
with the old protocol.

> <Virtio-serial>
> This is new feature, so trace-cmd does not need to keep backward
> compatibility.

Right, the virtio-serial will noly need to be supported by the v2
protocols.

-- Steve

2013-08-27 08:07:43

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

Hi Steven,

(2013/08/26 23:22), Steven Rostedt wrote:
> On Mon, 26 Aug 2013 10:46:38 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>>> The --date option is used because the two machines are not in sync with
>>> the trace time stamp. What the date option does, is to sync the
>>> timestamp up with the gettimeofday and the output reports that. This
>>> allows the two boxes to report information that is relatively close to
>>> how the two interacted.
>>
>> Oh, I didn't know the --date option.
>> As you mentioned, we can merge trace data in chronological order by
>> using --date option if the times of those machines are synchronized by
>> NTP.
>>
>>> If the guest and the host have the same clock, then the --date option
>>> is not needed and the two should be able to be merged normally.
>>
>> No, we can not assure that the guest and the host have the same clock
>> even if it is running on the same physical machine, because both kernel
>> doesn't share it, there is some difference between them. So, we still
>> need time synchronizing guest-host by NTP and --date option.
>>
>> However, there are cases that times of those machines cannot be
>> synchronized. For example, although multiple users can run guests on
>> virtualization environments (e.g. multi-tenant cloud hosting), there
>> are no guarantee that they use the same NTP server. Moreover, even if
>> the times are synchronized, trace data cannot exactly be merged because
>> the NTP-synchronized time granularity may not be enough fine for
>> sorting guest-host switching events.
>
> Right, unless there's some other means no synchronize between boxes,
> this is currently the best we have.

I'm considering that trace data use x86-tsc as timestamp in order to
merge trace data. By using x86-tsc, we can merge trace data even if
time of those machines is not synchronized. And the precision will be
enough for understanding operations of guests and host. However, TSC
values on a guest are not equal to the values on the host because
TSC_guest = TSC_host + TSC_offset.
This series actually doesn't support TSC offset, but I'd like to
add such feature to fix host/guest clock difference in the next series.
TSC offset values can be gotten as write_tsc_offset trace event from
kernel-3.11. (see https://lkml.org/lkml/2013/6/12/72)
How do you think about this merging feature?

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-27 08:08:49

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 07/11] [CLEANUP] trace-cmd: Split out binding a port and fork reader from open_udp()

(2013/08/26 23:37), Steven Rostedt wrote:
> On Mon, 26 Aug 2013 10:48:15 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>
>>>> +static int open_udp(const char *node, const char *port, int *pid,
>>>> + int cpu, int pagesize, int start_port)
>>>> +{
>>>> + int sfd;
>>>> + int num_port;
>>>> +
>>>> + num_port = udp_bind_a_port(start_port, &sfd);
>>>> + if (num_port < 0)
>>>> + return num_port;
>>>
>>> I don't see how num_port could be less than zero.
>>
>> I think so, but trace-cmd checks whether udp_port is less than zero or
>> not in create_all_readers().
>>
>> May I submit the removal patch?
>
> It's not a critical path, the check is fine. As there was a check, I
> was thinking that you were expecting it to pass an error which it never
> did.
>
> Perhaps just add a comment stating that udp_bind_a_port() currently
> does not return an error, but if that changes in the future, we have a
> check for it now.
>
> That way a reviewer of the code does not get confused by it.

OK, I'll delete the check and add comment as follows:

/*
* udp_bind_a_port() currently does not return an error, but if that
* changes in the future, we have a check for it now.
*/
num_port = udp_bind_a_port(start_port, &sfd);

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-27 10:23:29

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

(2013/08/27 0:11), Steven Rostedt wrote:
> On Mon, 26 Aug 2013 10:50:33 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>> <Network>
>> 0. old server and old client
>> Old servers send "tracecmd" as the first message.
>> Old clients compare the first 8byte of the first message with "tracecmd".
>>
>> 1. new server
>> - Send "tracecmd-v2" as the first message.
>> - Check the reply message whether the message is "tracecmd-v2" or cpus
>> value.
>> If "tracecmd-v2", the server uses new protocol and wait for the
>> message MSG_TINIT.
>> If cpus value, the server uses old protocol.
>
> That will kill an old client. Yes it only checks the first 8 bytes, but
> then it gets back a comma separated list of CPUs. As it uses TCP for
> this transaction, the unread "-v2" will end up being read as CPUs by
> the client.

Ah, that's true. This method is inappropriate.

>>
>> 2. new client
>> - Receive the first message.
>> - Check the message whether the message is "tracecmd-v2" or not.
>> If "tracecmd-v2", the client sends "tracecmd-v2" to the server. Then,
>> the client sends the message MSG_TINIT.
>> If "tracecmd", the client sends cpus value as the old protocol.
>>
>
> It will have to be the client that determines if the protocol is v2 or
> not. Basically, after receiving the "tracecmd" the client can send back
> "V2", and then expect the server to reply with "V2" as well. If the
> server does not, then it disconnects from the server and then restarts
> with the old protocol.

OK, let me check that. Even if the old server will receive "V2", the
server will send port numbers instead of "V2" due to the old protocol.
In that time, the new client will disconnect from the old server and
the restarts with the old protocol. Is it OK?

Thanks!
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-27 13:00:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

On Tue, 27 Aug 2013 17:07:34 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:
et, but I'd like to
> add such feature to fix host/guest clock difference in the next series.
> TSC offset values can be gotten as write_tsc_offset trace event from
> kernel-3.11. (see https://lkml.org/lkml/2013/6/12/72)
> How do you think about this merging feature?

If it's already been merged, then sure, we can use it.

-- Steve

2013-08-27 13:05:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

On Tue, 27 Aug 2013 19:23:24 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> OK, let me check that. Even if the old server will receive "V2", the
> server will send port numbers instead of "V2" due to the old protocol.
> In that time, the new client will disconnect from the old server and
> the restarts with the old protocol. Is it OK?

Yep, that's exactly what I meant ;-)

-- Steve

2013-08-28 11:30:29

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

(2013/08/27 22:05), Steven Rostedt wrote:
> On Tue, 27 Aug 2013 19:23:24 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>> OK, let me check that. Even if the old server will receive "V2", the
>> server will send port numbers instead of "V2" due to the old protocol.
>> In that time, the new client will disconnect from the old server and
>> the restarts with the old protocol. Is it OK?
>
> Yep, that's exactly what I meant ;-)

I tried to implement the feature, but I found that sending just "V2"
from the new client is inappropriate. This is because the old server
doesn't respond to the client before receiving cpu numbers, page size,
and options. So, when the new client sends the first message, it should
send "V2\0<MAGIC_NUMBER>\00\0", I think. If so, the old server will
understand the message as cpus=0, pagesize=<MAGIC_NUMBER>, options=0,
and then it will send port numbers(actually \0). Note if <MAGIC_NUMBER>
is zero, the old server will die.

Can I implement the first message of the new client as
"V2\0<MAGIC_NUMBER>\00\0"?

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-28 11:31:36

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 00/11] trace-cmd: Support the feature recording trace data of guests on the host

(2013/08/27 22:00), Steven Rostedt wrote:
> On Tue, 27 Aug 2013 17:07:34 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
> et, but I'd like to
>> add such feature to fix host/guest clock difference in the next series.
>> TSC offset values can be gotten as write_tsc_offset trace event from
>> kernel-3.11. (see https://lkml.org/lkml/2013/6/12/72)
>> How do you think about this merging feature?
>
> If it's already been merged, then sure, we can use it.

Great. I'll implement the feature for trace-cmd.

Thank you for your a lot of comments.

Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2013-08-28 14:42:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

On Wed, 28 Aug 2013 20:30:17 +0900
Yoshihiro YUNOMAE <[email protected]> wrote:

> (2013/08/27 22:05), Steven Rostedt wrote:
> > On Tue, 27 Aug 2013 19:23:24 +0900
> > Yoshihiro YUNOMAE <[email protected]> wrote:
> >
> >> OK, let me check that. Even if the old server will receive "V2", the
> >> server will send port numbers instead of "V2" due to the old protocol.
> >> In that time, the new client will disconnect from the old server and
> >> the restarts with the old protocol. Is it OK?
> >
> > Yep, that's exactly what I meant ;-)
>
> I tried to implement the feature, but I found that sending just "V2"
> from the new client is inappropriate. This is because the old server
> doesn't respond to the client before receiving cpu numbers, page size,
> and options. So, when the new client sends the first message, it should
> send "V2\0<MAGIC_NUMBER>\00\0", I think. If so, the old server will
> understand the message as cpus=0, pagesize=<MAGIC_NUMBER>, options=0,
> and then it will send port numbers(actually \0). Note if <MAGIC_NUMBER>
> is zero, the old server will die.
>
> Can I implement the first message of the new client as
> "V2\0<MAGIC_NUMBER>\00\0"?
>

If it works for both old and new server with old and new client (and
all combinations) I'm fine with it.

Thanks,

-- Steve

2013-08-29 01:57:13

by Yoshihiro YUNOMAE

[permalink] [raw]
Subject: Re: Re: [RFC PATCH 08/11] trace-cmd: Apply the trace-msg protocol for communication between a server and clients

(2013/08/28 23:42), Steven Rostedt wrote:
> On Wed, 28 Aug 2013 20:30:17 +0900
> Yoshihiro YUNOMAE <[email protected]> wrote:
>
>> (2013/08/27 22:05), Steven Rostedt wrote:
>>> On Tue, 27 Aug 2013 19:23:24 +0900
>>> Yoshihiro YUNOMAE <[email protected]> wrote:
>>>
>>>> OK, let me check that. Even if the old server will receive "V2", the
>>>> server will send port numbers instead of "V2" due to the old protocol.
>>>> In that time, the new client will disconnect from the old server and
>>>> the restarts with the old protocol. Is it OK?
>>>
>>> Yep, that's exactly what I meant ;-)
>>
>> I tried to implement the feature, but I found that sending just "V2"
>> from the new client is inappropriate. This is because the old server
>> doesn't respond to the client before receiving cpu numbers, page size,
>> and options. So, when the new client sends the first message, it should
>> send "V2\0<MAGIC_NUMBER>\00\0", I think. If so, the old server will
>> understand the message as cpus=0, pagesize=<MAGIC_NUMBER>, options=0,
>> and then it will send port numbers(actually \0). Note if <MAGIC_NUMBER>
>> is zero, the old server will die.
>>
>> Can I implement the first message of the new client as
>> "V2\0<MAGIC_NUMBER>\00\0"?
>>
>
> If it works for both old and new server with old and new client (and
> all combinations) I'm fine with it.

OK, I'll submit trace-cmd patches as V2 in a few days.

Thanks,
Yoshihiro YUNOMAE

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]