LinuxLists.cc - [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

2008-06-13 01:10:28

Subject: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

Semantically, start + offset is always lower than the total size in bytes,
so the code should check against equality to size, not size + 1. When
userspace did read() (but not necessarily restricted to this), this
resulted in all-zeros data at the end of the buffer.

Signed-off-by: Eduard - Gabriel Munteanu <[email protected]>
---
include/linux/relay.h | 4 ++--
kernel/relay.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 6cd8c44..8593ca1 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -202,7 +202,7 @@ static inline void relay_write(struct rchan *chan,

local_irq_save(flags);
buf = chan->buf[smp_processor_id()];
- if (unlikely(buf->offset + length > chan->subbuf_size))
+ if (unlikely(buf->offset + length >= chan->subbuf_size))
length = relay_switch_subbuf(buf, length);
memcpy(buf->data + buf->offset, data, length);
buf->offset += length;
@@ -228,7 +228,7 @@ static inline void __relay_write(struct rchan *chan,
struct rchan_buf *buf;

buf = chan->buf[get_cpu()];
- if (unlikely(buf->offset + length > buf->chan->subbuf_size))
+ if (unlikely(buf->offset + length >= buf->chan->subbuf_size))
length = relay_switch_subbuf(buf, length);
memcpy(buf->data + buf->offset, data, length);
buf->offset += length;
diff --git a/kernel/relay.c b/kernel/relay.c
index 7de644c..07f25e7 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -622,7 +622,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
if (unlikely(length > buf->chan->subbuf_size))
goto toobig;

- if (buf->offset != buf->chan->subbuf_size + 1) {
+ if (buf->offset != buf->chan->subbuf_size) {
buf->prev_padding = buf->chan->subbuf_size - buf->offset;
old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
buf->padding[old_subbuf] = buf->prev_padding;
@@ -645,7 +645,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
new = buf->start + new_subbuf * buf->chan->subbuf_size;
buf->offset = 0;
if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) {
- buf->offset = buf->chan->subbuf_size + 1;
+ buf->offset = buf->chan->subbuf_size;
return 0;
}
buf->data = new;
--
1.5.5.4

2008-06-14 04:40:53

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

Hi,

On Fri, 2008-06-13 at 04:09 +0300, Eduard - Gabriel Munteanu wrote:
> Semantically, start + offset is always lower than the total size in bytes,
> so the code should check against equality to size, not size + 1. When
> userspace did read() (but not necessarily restricted to this), this
> resulted in all-zeros data at the end of the buffer.
>

I'm wondering if the all-zeroes at the end of the buffer might be
another case of the all-zeroes you were seeing due to cross-cpu reading
you decribed in the other patch. In any case, I'm pretty sure this
patch isn't doing what you think it is, and don't see how it could have
fixed the problem (see below). There may still be a bug somewhere, but
it would be good to be able to reproduce it. Does it happen even when
running on a single cpu?

> Signed-off-by: Eduard - Gabriel Munteanu <[email protected]>
> ---
> include/linux/relay.h | 4 ++--
> kernel/relay.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/relay.h b/include/linux/relay.h
> index 6cd8c44..8593ca1 100644
> --- a/include/linux/relay.h
> +++ b/include/linux/relay.h
> @@ -202,7 +202,7 @@ static inline void relay_write(struct rchan *chan,
>
> local_irq_save(flags);
> buf = chan->buf[smp_processor_id()];
> - if (unlikely(buf->offset + length > chan->subbuf_size))
> + if (unlikely(buf->offset + length >= chan->subbuf_size))

The original code basically says, "if the current write won't fit in the
remaining space in current subbuffer, move on to the next sub-buffer".
Your change makes it "if the current write won't fit in the current
subbuffer or if it exactly fits, move on to the next sub-buffer", which
isn't what we want.

> length = relay_switch_subbuf(buf, length);
> memcpy(buf->data + buf->offset, data, length);
> buf->offset += length;
> @@ -228,7 +228,7 @@ static inline void __relay_write(struct rchan *chan,
> struct rchan_buf *buf;
>
> buf = chan->buf[get_cpu()];
> - if (unlikely(buf->offset + length > buf->chan->subbuf_size))

Same thing here.

> + if (unlikely(buf->offset + length >= buf->chan->subbuf_size))
> length = relay_switch_subbuf(buf, length);
> memcpy(buf->data + buf->offset, data, length);
> buf->offset += length;
> diff --git a/kernel/relay.c b/kernel/relay.c
> index 7de644c..07f25e7 100644
> --- a/kernel/relay.c
> +++ b/kernel/relay.c
> @@ -622,7 +622,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
> if (unlikely(length > buf->chan->subbuf_size))
> goto toobig;
>
> - if (buf->offset != buf->chan->subbuf_size + 1) {
> + if (buf->offset != buf->chan->subbuf_size) {

This case, offset being 1 larger than the subbuf size, is how we note a
full sub-buffer, so changing this will break full-subbuffer cases.

> buf->prev_padding = buf->chan->subbuf_size - buf->offset;
> old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs;
> buf->padding[old_subbuf] = buf->prev_padding;
> @@ -645,7 +645,7 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length)
> new = buf->start + new_subbuf * buf->chan->subbuf_size;
> buf->offset = 0;
> if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) {
> - buf->offset = buf->chan->subbuf_size + 1;
> + buf->offset = buf->chan->subbuf_size;

Same here.

Tom

> return 0;
> }
> buf->data = new;

2008-06-14 14:53:10

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Fri, 13 Jun 2008 23:40:37 -0500
Tom Zanussi <[email protected]> wrote:

> I'm wondering if the all-zeroes at the end of the buffer might be
> another case of the all-zeroes you were seeing due to cross-cpu
> reading you decribed in the other patch. In any case, I'm pretty
> sure this patch isn't doing what you think it is, and don't see how
> it could have fixed the problem (see below). There may still be a
> bug somewhere, but it would be good to be able to reproduce it. Does
> it happen even when running on a single cpu?

Hi,

I noticed this problem after adding those spinlocks. As far as I can
tell, having (offset == subbuf_size + 1) at any given moment allows the
read() handler to see inconsistent offsets:
1. writer sets offset = subbuf_size + 1
2. writer releases spinlock
3. read() acquires spinlock and reads the wrong offset
4. read() releases spinlock
5. next writer corrects the offset at the next write

> This case, offset being 1 larger than the subbuf size, is how we note
> a full sub-buffer, so changing this will break full-subbuffer cases.

No, it won't. Maximum length messages result in the following condition:
start + offset == subbuf_size
This happens because a buffer of length subbuf_size actually ranges
from zero to (subbuf_size - 1) in regard to how it is addressed. Then,
subbuf_size + 1 isn't just outside the bounds, but one more byte off.
"Visual" example:
subbuf_size = 4
|[ ][ ][ ][ ]|[ ]
0 1 2 3 subbuf_size

So, a full subbufer means offset equals subbuf_size, that is, the next
empty slot is just outside the subbuffer.

Eduard

2008-06-16 05:22:42

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Sat, 2008-06-14 at 17:52 +0300, Eduard - Gabriel Munteanu wrote:
> On Fri, 13 Jun 2008 23:40:37 -0500
> Tom Zanussi <[email protected]> wrote:
>
> > I'm wondering if the all-zeroes at the end of the buffer might be
> > another case of the all-zeroes you were seeing due to cross-cpu
> > reading you decribed in the other patch. In any case, I'm pretty
> > sure this patch isn't doing what you think it is, and don't see how
> > it could have fixed the problem (see below). There may still be a
> > bug somewhere, but it would be good to be able to reproduce it. Does
> > it happen even when running on a single cpu?
>
> Hi,
>
> I noticed this problem after adding those spinlocks. As far as I can
> tell, having (offset == subbuf_size + 1) at any given moment allows the
> read() handler to see inconsistent offsets:
> 1. writer sets offset = subbuf_size + 1
> 2. writer releases spinlock
> 3. read() acquires spinlock and reads the wrong offset
> 4. read() releases spinlock
> 5. next writer corrects the offset at the next write
>
> > This case, offset being 1 larger than the subbuf size, is how we note
> > a full sub-buffer, so changing this will break full-subbuffer cases.
>
> No, it won't. Maximum length messages result in the following condition:
> start + offset == subbuf_size
> This happens because a buffer of length subbuf_size actually ranges
> from zero to (subbuf_size - 1) in regard to how it is addressed. Then,
> subbuf_size + 1 isn't just outside the bounds, but one more byte off.
> "Visual" example:
> subbuf_size = 4
> |[ ][ ][ ][ ]|[ ]
> 0 1 2 3 subbuf_size
>
> So, a full subbufer means offset equals subbuf_size, that is, the next
> empty slot is just outside the subbuffer.
>

Yes, I understand that - what I meant was that the subbuf_size + 1
condition happens only in the buffer-full case (i.e. no reader or
lagging reader), but not during the normal filling of a subbuffer, which
you describe correctly.

So apparently what you're seeing is zeroes being read when there's a
buffer-full condition? If so, we need to figure out exactly why that's
happening to see whether your fix is really what's needed; I haven't
seen problems in the buffer-full case before and I think your fix would
break it even if it fixed your read problem. So it would be good to be
able to reproduce it first.

Tom

>
> Eduard

2008-06-21 02:07:20

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Mon, 16 Jun 2008 00:22:27 -0500
Tom Zanussi <[email protected]> wrote:

> So apparently what you're seeing is zeroes being read when there's a
> buffer-full condition? If so, we need to figure out exactly why
> that's happening to see whether your fix is really what's needed; I
> haven't seen problems in the buffer-full case before and I think your
> fix would break it even if it fixed your read problem. So it would
> be good to be able to reproduce it first.
>
> Tom

Hi,

Sorry for being so late, there were some exams I had to cope with.

Although I couldn't reproduce zeros, I've come up with something I'd
say is equally good. This has been done on a vanilla 2.6.26-rc6.

Please look at the testcase below and tell me what you think.

And yes, the changes in relay_write() and __relay_write() are inappropriate.

Cheers,
Eduard

---
diff --git a/kernel/Makefile b/kernel/Makefile
index 1c9938a..5e1a32b 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
- notifier.o ksysfs.o pm_qos_params.o sched_clock.o
+ notifier.o ksysfs.o pm_qos_params.o sched_clock.o relay_test.o

obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/relay_test.c b/kernel/relay_test.c
new file mode 100644
index 0000000..38c2755
--- /dev/null
+++ b/kernel/relay_test.c
@@ -0,0 +1,92 @@
+#include <linux/debugfs.h>
+#include <linux/relay.h>
+
+static struct rchan *relay_test_chan;
+
+static void relay_test_do_test_writes(int test)
+{
+ u32 poison32 = 0x32323232;
+ u16 poison16 = 0x1616;
+
+ switch (test) {
+ case 0: /* Half full subbuf. */
+ relay_write(relay_test_chan, &poison16, 2);
+ break;
+ case 1: /* Full subbuf. */
+ relay_write(relay_test_chan, &poison32, 4);
+ break;
+ case 2: /* Crossing subbuf boundaries. */
+ relay_write(relay_test_chan, &poison16, 2);
+ relay_write(relay_test_chan, &poison32, 4);
+ break;
+ case 3: /* Full buffer. */
+ relay_write(relay_test_chan, &poison32, 4);
+ relay_write(relay_test_chan, &poison32, 4);
+ break;
+ default:
+ printk(KERN_INFO "relay_test: No such test!\n");
+ }
+}
+
+static ssize_t relay_test_trigger_write(struct file *filp,
+ const char __user *userdata,
+ size_t count, loff_t *ppos)
+{
+ char data[5];
+ unsigned long test, err, bytes = min(count, (size_t) 4);
+
+ err = copy_from_user(&data, userdata, bytes);
+ if (err) {
+ printk(KERN_ERR "relay_test: Error reading from user!\n");
+ return -1;
+ }
+ data[4] = '\0';
+ test = simple_strtoul(data, NULL, 10);
+
+ printk(KERN_INFO "relay_test: Doing test %lu on CPU %lu.\n",
+ test, get_cpu());
+ put_cpu();
+ relay_test_do_test_writes(test);
+
+ return bytes;
+}
+
+static struct file_operations trigger_fops = {
+ .open = nonseekable_open,
+ .write = relay_test_trigger_write,
+};
+
+static struct dentry *
+relay_test_create_buf_file(const char *filename, struct dentry *parent,
+ int mode, struct rchan_buf *buf, int *is_global)
+{
+ return debugfs_create_file(filename, mode, parent,
+ buf, &relay_file_operations);
+}
+
+static struct rchan_callbacks relay_test_callbacks = {
+ .create_buf_file = relay_test_create_buf_file,
+};
+
+static int relay_test_init(void)
+{
+ struct dentry *trigger;
+
+ trigger = debugfs_create_file("relay_test_trigger", 0600, NULL,
+ NULL, &trigger_fops);
+ if (!trigger) {
+ printk(KERN_ERR "relay_test: could not create trigger debugfs file!\n");
+ return 1;
+ }
+
+ relay_test_chan = relay_open("relay_test_cpu", NULL,
+ 4, 2, &relay_test_callbacks, NULL);
+ if (!relay_test_chan) {
+ printk(KERN_ERR "relay_test: could not open channel!\n");
+ return 1;
+ }
+
+ return 0;
+}
+
+late_initcall(relay_test_init);
---

Results, first run:

Test 0:
0000000 1616
0000002
---
Test 1:
0000000 3232 3232
0000004
---
Test 2:
0000000 1616
0000002
---
Test 3:
0000000 3232 3232 3232 3232
0000008
---
Test 0, 0:
---
Test 0, 0, 0, 0:
0000000 1616 1616 1616 1616
0000008
---
What the kernel said:
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 1 on CPU 1.
relay_test: Doing test 2 on CPU 1.
relay_test: Doing test 3 on CPU 0.
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 0 on CPU 0.

Results, second run, without rebooting in between:

Test 0:
0000000 1616
0000002
---
Test 1:
0000000 3232 3232
0000004
---
Test 2:
0000000 1616
0000002
---
Test 3:
---
Test 0, 0:
0000000 3232 3232 3232 3232
0000008
---
Test 0, 0, 0, 0:
---
What the kernel said:
relay_test: Doing test 0 on CPU 0.
relay_test: Doing test 1 on CPU 1.
relay_test: Doing test 2 on CPU 1.
relay_test: Doing test 3 on CPU 1.
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 0 on CPU 1.
relay_test: Doing test 0 on CPU 1.

The ugly script used to do this (on the second run I edited the file to
remove previous dmesg stuff):

#!/bin/bash
echo "Test 0:" > /output.vanilla
echo "0" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "Test 1:" >> /output.vanilla
echo "1" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "Test 2:" >> /output.vanilla
echo "2" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "Test 3:" >> /output.vanilla
echo "3" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "Test 0, 0:" >> /output.vanilla
echo "0" > /debug/relay_test_trigger
echo "0" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "Test 0, 0, 0, 0:" >> /output.vanilla
echo "0" > /debug/relay_test_trigger
echo "0" > /debug/relay_test_trigger
echo "0" > /debug/relay_test_trigger
echo "0" > /debug/relay_test_trigger
sleep 1
hexdump /debug/relay_test_cpu* >> /output.vanilla
sleep 1
echo "---" >> /output.vanilla

echo "What the kernel said:" >> /output.vanilla
dmesg | grep relay_test >> /output.vanilla

2008-07-24 05:09:58

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Sat, 2008-06-21 at 05:06 +0300, Eduard - Gabriel Munteanu wrote:
> On Mon, 16 Jun 2008 00:22:27 -0500
> Tom Zanussi <[email protected]> wrote:
>
> > So apparently what you're seeing is zeroes being read when there's a
> > buffer-full condition? If so, we need to figure out exactly why
> > that's happening to see whether your fix is really what's needed; I
> > haven't seen problems in the buffer-full case before and I think your
> > fix would break it even if it fixed your read problem. So it would
> > be good to be able to reproduce it first.
> >
> > Tom
>
> Hi,
>
> Sorry for being so late, there were some exams I had to cope with.
>
> Although I couldn't reproduce zeros, I've come up with something I'd
> say is equally good. This has been done on a vanilla 2.6.26-rc6.
>
> Please look at the testcase below and tell me what you think.
>

Hi,

Yes, this is a bug - thanks for sending the nice test case. This patch
should fix it. BTW, if the output still looks a little different from
what you were expecting, it might make more sense after adding a
relay_test printk for each dropped event, something like this:

diff --git a/kernel/relay_test.c b/kernel/relay_test.c
index 38c2755..2d4ecb1 100644
--- a/kernel/relay_test.c
+++ b/kernel/relay_test.c
@@ -64,8 +64,23 @@ relay_test_create_buf_file(const char *filename, struct dentry *parent,
buf, &relay_file_operations);
}

+static int relay_test_subbuf_start(struct rchan_buf *buf,
+ void *subbuf,
+ void *prev_subbuf,
+ size_t prev_padding)
+{
+ if (relay_buf_full(buf)) {
+ printk(KERN_INFO "relay_test: Dropped event on CPU %lu.\n",
+ smp_processor_id());
+ return 0;
+ }
+
+ return 1;
+}
+
static struct rchan_callbacks relay_test_callbacks = {
.create_buf_file = relay_test_create_buf_file,
+ .subbuf_start = relay_test_subbuf_start,
};

static int relay_test_init(void)

Thanks,

Tom

---

In relay's current read implementation, if the buffer is completely full
but hasn't triggered the buffer-full condition (i.e. the last write
didn't cross the subbuffer boundary) and the last subbuffer is exactly
full, the subbuffer accounting code erroneously finds nothing available.
This patch fixes the problem.

Signed-off-by: Tom Zanussi <[email protected]>

diff --git a/kernel/relay.c b/kernel/relay.c
index 7de644c..f5a5a96 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -832,6 +832,10 @@ static void relay_file_read_consume(struct rchan_buf *buf,
size_t n_subbufs = buf->chan->n_subbufs;
size_t read_subbuf;

+ if (buf->subbufs_produced == buf->subbufs_consumed &&
+ buf->offset == buf->bytes_consumed)
+ return;
+
if (buf->bytes_consumed + bytes_consumed > subbuf_size) {
relay_subbufs_consumed(buf->chan, buf->cpu, 1);
buf->bytes_consumed = 0;
@@ -863,6 +867,8 @@ static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos)

relay_file_read_consume(buf, read_pos, 0);

+ consumed = buf->subbufs_consumed;
+
if (unlikely(buf->offset > subbuf_size)) {
if (produced == consumed)
return 0;
@@ -881,8 +887,12 @@ static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos)
if (consumed > produced)
produced += n_subbufs * subbuf_size;

- if (consumed == produced)
+ if (consumed == produced) {
+ if (buf->offset == subbuf_size &&
+ buf->subbufs_produced > buf->subbufs_consumed)
+ return 1;
return 0;
+ }

return 1;
}

2008-07-30 17:50:48

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Thu, Jul 24, 2008 at 12:09:36AM -0500, Tom Zanussi wrote:
>
> On Sat, 2008-06-21 at 05:06 +0300, Eduard - Gabriel Munteanu wrote:
> > On Mon, 16 Jun 2008 00:22:27 -0500
> > Tom Zanussi <[email protected]> wrote:
> >
> > > So apparently what you're seeing is zeroes being read when there's a
> > > buffer-full condition? If so, we need to figure out exactly why
> > > that's happening to see whether your fix is really what's needed; I
> > > haven't seen problems in the buffer-full case before and I think your
> > > fix would break it even if it fixed your read problem. So it would
> > > be good to be able to reproduce it first.
> > >
> > > Tom
> >
> > Hi,
> >
> > Sorry for being so late, there were some exams I had to cope with.
> >
> > Although I couldn't reproduce zeros, I've come up with something I'd
> > say is equally good. This has been done on a vanilla 2.6.26-rc6.
> >
> > Please look at the testcase below and tell me what you think.
> >
>
> Hi,
>
> Yes, this is a bug - thanks for sending the nice test case. This patch
> should fix it. BTW, if the output still looks a little different from
> what you were expecting, it might make more sense after adding a
> relay_test printk for each dropped event, something like this:

The all-zeros problem happened when poll() with infinite timeout is not
used before reading. I fixed the user app I was writing, but the docs should
really reflect this. Relay has a kinda strange interface.

Cheers,
Eduard

2008-08-13 06:17:17

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Wed, 2008-07-30 at 20:48 +0300, Eduard - Gabriel Munteanu wrote:
> On Thu, Jul 24, 2008 at 12:09:36AM -0500, Tom Zanussi wrote:
> >
> > On Sat, 2008-06-21 at 05:06 +0300, Eduard - Gabriel Munteanu wrote:
> > > On Mon, 16 Jun 2008 00:22:27 -0500
> > > Tom Zanussi <[email protected]> wrote:
> > >
> > > > So apparently what you're seeing is zeroes being read when there's a
> > > > buffer-full condition? If so, we need to figure out exactly why
> > > > that's happening to see whether your fix is really what's needed; I
> > > > haven't seen problems in the buffer-full case before and I think your
> > > > fix would break it even if it fixed your read problem. So it would
> > > > be good to be able to reproduce it first.
> > > >
> > > > Tom
> > >
> > > Hi,
> > >
> > > Sorry for being so late, there were some exams I had to cope with.
> > >
> > > Although I couldn't reproduce zeros, I've come up with something I'd
> > > say is equally good. This has been done on a vanilla 2.6.26-rc6.
> > >
> > > Please look at the testcase below and tell me what you think.
> > >
> >
> > Hi,
> >
> > Yes, this is a bug - thanks for sending the nice test case. This patch
> > should fix it. BTW, if the output still looks a little different from
> > what you were expecting, it might make more sense after adding a
> > relay_test printk for each dropped event, something like this:
>
> The all-zeros problem happened when poll() with infinite timeout is not
> used before reading. I fixed the user app I was writing, but the docs should
> really reflect this. Relay has a kinda strange interface.
>
>

Hi,

You shouldn't have to use poll() with infinite timeout - see for example
the blktrace user code. I'm not sure why you're seeing these problems,
but if you're interested, even if only for comparison, you can try using
the blktrace user code for kmemtrace too - I recently turned the
blktrace userspace code into a generic tracing library called libutt,
which you can get here:

http://utt.sourceforge.net

Here are some patches to your code that take a first stab at kmemtrace
using utt:

http://utt.sourceforge.net/kmemtrace-utt-kernel.patch
http://utt.sourceforge.net/kmemtrace-utt-user.patch

Most of the kernel patch just adds tracing controls for the purpose of
controlling the trace:

root@positron:/home/trz/kmemtrace-user# ls -al /sys/kernel/debug/kmem/trace
total 0
drwxr-xr-x 2 root root 0 2008-08-12 22:48 .
drwxr-xr-x 3 root root 0 2008-08-12 22:48 ..
-r-------- 1 root root 0 2008-08-12 22:48 abi_version
---------- 1 root root 0 2008-08-12 22:48 buf_nr
---------- 1 root root 0 2008-08-12 22:48 buf_size
---------- 1 root root 0 2008-08-12 22:48 create
---------- 1 root root 0 2008-08-12 22:48 destroy
-r--r--r-- 1 root root 0 2008-08-12 22:48 dropped
---------- 1 root root 0 2008-08-12 22:48 event_mask
---------- 1 root root 0 2008-08-12 22:48 start
---------- 1 root root 0 2008-08-12 22:48 stop
-r-------- 1 root root 9175040 2008-08-12 22:48 trace0
-r-------- 1 root root 1543192 2008-08-12 22:48 trace1

You'll also notice that the directory structure is a little different
(/sys/kernel/debug/kmem/trace instead of /sys/kernel/debug/kmemtrace)
and that the per-cpu files are named traceX instead of cpuX - this is
just because it follows the blktrace conventions.

The other change to the kernel code is that it allows the buffers (and
buffer sizes) to be created on demand instead of always allocated at
boot time. The boot-time buffer and tracing still works as before, but
it actually turns tracing off in kmem_setup_late(), since there's really
no point in continuing logging and overrunning the buffers until
kmemtraced is started up to read it. Having the buffers allocated and
tracing started by kmemtraced also makes more sense to me, rather than
the current behavior of continually logging and overrunning the buffers
forever regardless of whether something is reading them. Maybe that's
not exactly what you'd want, but it made more sense to me, so that's
what I did for this.

So anyway, to use it, after you boot up, you can do:

root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -t
^CChannel: trace
CPU 0: 0 events, 8960 KiB data
CPU 1: 0 events, 1508 KiB data
Total: 0 events (dropped 3124), 10468 KiB data
You have dropped events, consider using a larger buffer size (-b)

(yeah, I know, it shouldn't say 0 events, there were plenty!)

This will read the contents of the per-cpu relay files containing what
you logged during boot into trace.kmem.0 and trace.kmem.1 - you'll have
to rename them to cpu0.out,etc to use them with your analysis tools.
The ctrl-c does the same thing as a normal trace i.e. stops the trace,
reads the files and destroys the buffers.

You can start a new trace at any time using:

root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt
^CChannel: trace
CPU 0: 0 events, 316 KiB data
CPU 1: 0 events, 223 KiB data
Total: 0 events (dropped 0), 538 KiB data

Again, ctrl-c to stop the trace and clean up.

Also, because you're using part of the blktrace code, you get some other
capabilities for free, such as tracing over the network. This starts
the trace server, which will receive the trace data:

root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -l
server: waiting for connections...
server: connection from 127.0.0.1

Now run the trace on the client, and ctrl-c to stop it:

root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -h localhost
kmem: connecting to localhost
kmem: connected!
^CChannel: trace
CPU 0: 0 events, 581 KiB data
CPU 1: 0 events, 399 KiB data
Total: 0 events (dropped 0), 980 KiB data

And you should see something like this on the server as well:

server: end of run for trace
Channel: trace
CPU 0: 0 events, 581 KiB data
CPU 1: 0 events, 399 KiB data
Total: 0 events (dropped 0), 980 KiB data

The trace files will be in a directory named using the IP address and
time:

root@positron:/home/trz/kmemtrace-user# ls -al 127.0.0.1-2008-08-13-04:12:39
total 1000
drwxr-xr-x 2 root root 4096 2008-08-12 23:12 .
drwxr-xr-x 6 trz trz 4096 2008-08-12 23:12 ..
-rw-r--r-- 1 root root 594944 2008-08-12 23:13 trace.kmem.0
-rw-r--r-- 1 root root 408352 2008-08-12 23:13 trace.kmem.1

libutt isn't completely baked at this point, and the API will probably
change depending on feedback, but still it seems to work pretty well at
this point. If you decide to try it out, please let me know if you run
into problems with it too.

Tom

> Cheers,
> Eduard
>

2008-08-14 16:38:21

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

On Wed, Aug 13, 2008 at 01:16:59AM -0500, Tom Zanussi wrote:
>
> Hi,

Hi,

> You shouldn't have to use poll() with infinite timeout - see for example
> the blktrace user code. I'm not sure why you're seeing these problems,
> but if you're interested, even if only for comparison, you can try using
> the blktrace user code for kmemtrace too - I recently turned the
> blktrace userspace code into a generic tracing library called libutt,
> which you can get here:
>
> http://utt.sourceforge.net
>
> Here are some patches to your code that take a first stab at kmemtrace
> using utt:
>
> http://utt.sourceforge.net/kmemtrace-utt-kernel.patch
> http://utt.sourceforge.net/kmemtrace-utt-user.patch

Thanks a lot, I'll have a look at them and try this instead.

> Most of the kernel patch just adds tracing controls for the purpose of
> controlling the trace:
>
> root@positron:/home/trz/kmemtrace-user# ls -al /sys/kernel/debug/kmem/trace
> total 0
> drwxr-xr-x 2 root root 0 2008-08-12 22:48 .
> drwxr-xr-x 3 root root 0 2008-08-12 22:48 ..
> -r-------- 1 root root 0 2008-08-12 22:48 abi_version
> ---------- 1 root root 0 2008-08-12 22:48 buf_nr
> ---------- 1 root root 0 2008-08-12 22:48 buf_size
> ---------- 1 root root 0 2008-08-12 22:48 create
> ---------- 1 root root 0 2008-08-12 22:48 destroy
> -r--r--r-- 1 root root 0 2008-08-12 22:48 dropped
> ---------- 1 root root 0 2008-08-12 22:48 event_mask
> ---------- 1 root root 0 2008-08-12 22:48 start
> ---------- 1 root root 0 2008-08-12 22:48 stop
> -r-------- 1 root root 9175040 2008-08-12 22:48 trace0
> -r-------- 1 root root 1543192 2008-08-12 22:48 trace1
>
> You'll also notice that the directory structure is a little different
> (/sys/kernel/debug/kmem/trace instead of /sys/kernel/debug/kmemtrace)
> and that the per-cpu files are named traceX instead of cpuX - this is
> just because it follows the blktrace conventions.
>
> The other change to the kernel code is that it allows the buffers (and
> buffer sizes) to be created on demand instead of always allocated at
> boot time. The boot-time buffer and tracing still works as before, but
> it actually turns tracing off in kmem_setup_late(), since there's really
> no point in continuing logging and overrunning the buffers until
> kmemtraced is started up to read it. Having the buffers allocated and
> tracing started by kmemtraced also makes more sense to me, rather than
> the current behavior of continually logging and overrunning the buffers
> forever regardless of whether something is reading them. Maybe that's
> not exactly what you'd want, but it made more sense to me, so that's
> what I did for this.
>
> So anyway, to use it, after you boot up, you can do:
>
> root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -t
> ^CChannel: trace
> CPU 0: 0 events, 8960 KiB data
> CPU 1: 0 events, 1508 KiB data
> Total: 0 events (dropped 3124), 10468 KiB data
> You have dropped events, consider using a larger buffer size (-b)
>
> (yeah, I know, it shouldn't say 0 events, there were plenty!)
>
> This will read the contents of the per-cpu relay files containing what
> you logged during boot into trace.kmem.0 and trace.kmem.1 - you'll have
> to rename them to cpu0.out,etc to use them with your analysis tools.
> The ctrl-c does the same thing as a normal trace i.e. stops the trace,
> reads the files and destroys the buffers.
>
> You can start a new trace at any time using:
>
> root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt
> ^CChannel: trace
> CPU 0: 0 events, 316 KiB data
> CPU 1: 0 events, 223 KiB data
> Total: 0 events (dropped 0), 538 KiB data
>
> Again, ctrl-c to stop the trace and clean up.
>
> Also, because you're using part of the blktrace code, you get some other
> capabilities for free, such as tracing over the network. This starts
> the trace server, which will receive the trace data:
>
> root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -l
> server: waiting for connections...
> server: connection from 127.0.0.1
>
> Now run the trace on the client, and ctrl-c to stop it:
>
> root@positron:/home/trz/kmemtrace-user# ./kmemtraced-utt -h localhost
> kmem: connecting to localhost
> kmem: connected!
> ^CChannel: trace
> CPU 0: 0 events, 581 KiB data
> CPU 1: 0 events, 399 KiB data
> Total: 0 events (dropped 0), 980 KiB data
>
> And you should see something like this on the server as well:
>
> server: end of run for trace
> Channel: trace
> CPU 0: 0 events, 581 KiB data
> CPU 1: 0 events, 399 KiB data
> Total: 0 events (dropped 0), 980 KiB data
>
> The trace files will be in a directory named using the IP address and
> time:
>
> root@positron:/home/trz/kmemtrace-user# ls -al 127.0.0.1-2008-08-13-04:12:39
> total 1000
> drwxr-xr-x 2 root root 4096 2008-08-12 23:12 .
> drwxr-xr-x 6 trz trz 4096 2008-08-12 23:12 ..
> -rw-r--r-- 1 root root 594944 2008-08-12 23:13 trace.kmem.0
> -rw-r--r-- 1 root root 408352 2008-08-12 23:13 trace.kmem.1
>
>
> libutt isn't completely baked at this point, and the API will probably
> change depending on feedback, but still it seems to work pretty well at
> this point. If you decide to try it out, please let me know if you run
> into problems with it too.
>
> Tom

Looks nice, as I had already implemented a way to turn off kmemtrace
when logging. I'll get back to you in a few days, I'm not close to home
right now.

BTW, did the kmemtrace user app work for you? Pekka seemed to have
issues with compiling it.

Thanks,
Eduard

2008-08-15 04:32:13

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 1/3] relay: Fix 4 off-by-one errors occuring when writing to a CPU buffer.

Hi,

On Thu, 2008-08-14 at 19:35 +0300, Eduard - Gabriel Munteanu wrote:

> BTW, did the kmemtrace user app work for you? Pekka seemed to have
> issues with compiling it.
>

Just a small problem compiling - there was no Makefile or configure, and
the README doesn't say anything about autogen.sh, though I later saw
something about it in the kernel documentation. Anyway, autogen.sh
failed because I didn't have autoconf/automake installed; after doing
that I didn't have a problem compiling. I did notice a couple printf
format warnings, though.

I then ran kmemtraced and ran the check and report on the output. The
check passed but the report tool just spewed a lot of error lines with
the same caller. I looked a little closer and realized there was
something really wrong with the reporting tool e.g. a tiny trace of
about 10k, probably fewer than 200 events (a hexdump of the data looked
ok to me), produced a report of about 800k. Remembering the printf
format warnings, I realized it probably hadn't been run on a 32-bit
system lately, so I tried it all again on a 64-bit system, and there it
seemed to work ok, or at least the output looked reasonable, to my
untrained eye.

Tom