LinuxLists.cc - [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

2008-06-13 01:10:44

Subject: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

Suppose CPU0, as instructed by userspace, reads CPU1's data. If the
latter is logging data, it's not enough to disable IRQs or preemption
to protect the data. Added a per-buffer (thus per-CPU) spinlock to
prevent concurrent access. The choice of using a spinlock is motivated
by the need to log data (and thus lock the buffer) from interrupt
context. The problem was revealed when working on kmemtrace, where some
events were seemingly out-of-order or just all-zeros, even though the
necessary precautions had already been taken.

Signed-off-by: Eduard - Gabriel Munteanu <[email protected]>
---
include/linux/relay.h | 10 +++++++---
kernel/relay.c | 11 ++++++++++-
2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 8593ca1..a3a03e7 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -38,6 +38,7 @@ struct rchan_buf
size_t subbufs_produced; /* count of sub-buffers produced */
size_t subbufs_consumed; /* count of sub-buffers consumed */
struct rchan *chan; /* associated channel */
+ spinlock_t rw_lock; /* protects buffer during R/W */
wait_queue_head_t read_wait; /* reader wait queue */
struct timer_list timer; /* reader wake-up timer */
struct dentry *dentry; /* channel file dentry */
@@ -200,13 +201,14 @@ static inline void relay_write(struct rchan *chan,
unsigned long flags;
struct rchan_buf *buf;

- local_irq_save(flags);
- buf = chan->buf[smp_processor_id()];
+ buf = chan->buf[get_cpu()];
+ spin_lock_irqsave(&buf->rw_lock, flags);
if (unlikely(buf->offset + length >= chan->subbuf_size))
length = relay_switch_subbuf(buf, length);
memcpy(buf->data + buf->offset, data, length);
buf->offset += length;
- local_irq_restore(flags);
+ spin_unlock_irqrestore(&buf->rw_lock, flags);
+ put_cpu();
}

/**
@@ -228,10 +230,12 @@ static inline void __relay_write(struct rchan *chan,
struct rchan_buf *buf;

buf = chan->buf[get_cpu()];
+ spin_lock(&buf->rw_lock);
if (unlikely(buf->offset + length >= buf->chan->subbuf_size))
length = relay_switch_subbuf(buf, length);
memcpy(buf->data + buf->offset, data, length);
buf->offset += length;
+ spin_unlock(&buf->rw_lock);
put_cpu();
}

diff --git a/kernel/relay.c b/kernel/relay.c
index 07f25e7..250a27a 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -430,6 +430,8 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
if (!buf)
goto free_name;

+ spin_lock_init(&buf->rw_lock);
+
buf->cpu = cpu;
__relay_reset(buf, 1);

@@ -1013,11 +1015,13 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos,
struct rchan_buf *buf = filp->private_data;
size_t read_start, avail;
int ret;
+ unsigned long flags;

if (!desc->count)
return 0;

mutex_lock(&filp->f_path.dentry->d_inode->i_mutex);
+ spin_lock_irqsave(&buf->rw_lock, flags);
do {
if (!relay_file_read_avail(buf, *ppos))
break;
@@ -1028,15 +1032,20 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos,
break;

avail = min(desc->count, avail);
+ /* subbuf_actor may sleep, so release the spinlock for now */
+ spin_unlock_irqrestore(&buf->rw_lock, flags);
ret = subbuf_actor(read_start, buf, avail, desc, actor);
if (desc->error < 0)
- break;
+ goto out;
+ spin_lock_irqsave(&buf->rw_lock, flags);

if (ret) {
relay_file_read_consume(buf, read_start, ret);
*ppos = relay_file_read_end_pos(buf, read_start, ret);
}
} while (desc->count && ret);
+ spin_unlock_irqrestore(&buf->rw_lock, flags);
+out:
mutex_unlock(&filp->f_path.dentry->d_inode->i_mutex);

return desc->written;
--
1.5.5.4

2008-06-14 04:27:01

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

Hi,

On Fri, 2008-06-13 at 04:09 +0300, Eduard - Gabriel Munteanu wrote:
> Suppose CPU0, as instructed by userspace, reads CPU1's data. If the
> latter is logging data, it's not enough to disable IRQs or preemption
> to protect the data. Added a per-buffer (thus per-CPU) spinlock to
> prevent concurrent access. The choice of using a spinlock is motivated
> by the need to log data (and thus lock the buffer) from interrupt
> context. The problem was revealed when working on kmemtrace, where some
> events were seemingly out-of-order or just all-zeros, even though the
> necessary precautions had already been taken.
>

Alternatively, you could get rid of the problem by making sure CPU0
never reads CPU1's data, by having the userspace reader use per-cpu
threads and using sched_setaffinity() to pin each thread to a given cpu.
See for example, the blktrace code, which does this.

Actually, in a few days or so I'm planning on releasing the first cut of
a library that makes this and all the rest of the nice blktrace
userspace code available to other tracing applications, not just
blktrace. Hopefully it would be something that you'd be able to use for
kmemtrace as well; in that case, you'd just use the library and not have
to worry about these details.

Tom

> Signed-off-by: Eduard - Gabriel Munteanu <[email protected]>
> ---
> include/linux/relay.h | 10 +++++++---
> kernel/relay.c | 11 ++++++++++-
> 2 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/relay.h b/include/linux/relay.h
> index 8593ca1..a3a03e7 100644
> --- a/include/linux/relay.h
> +++ b/include/linux/relay.h
> @@ -38,6 +38,7 @@ struct rchan_buf
> size_t subbufs_produced; /* count of sub-buffers produced */
> size_t subbufs_consumed; /* count of sub-buffers consumed */
> struct rchan *chan; /* associated channel */
> + spinlock_t rw_lock; /* protects buffer during R/W */
> wait_queue_head_t read_wait; /* reader wait queue */
> struct timer_list timer; /* reader wake-up timer */
> struct dentry *dentry; /* channel file dentry */
> @@ -200,13 +201,14 @@ static inline void relay_write(struct rchan *chan,
> unsigned long flags;
> struct rchan_buf *buf;
>
> - local_irq_save(flags);
> - buf = chan->buf[smp_processor_id()];
> + buf = chan->buf[get_cpu()];
> + spin_lock_irqsave(&buf->rw_lock, flags);
> if (unlikely(buf->offset + length >= chan->subbuf_size))
> length = relay_switch_subbuf(buf, length);
> memcpy(buf->data + buf->offset, data, length);
> buf->offset += length;
> - local_irq_restore(flags);
> + spin_unlock_irqrestore(&buf->rw_lock, flags);
> + put_cpu();
> }
>
> /**
> @@ -228,10 +230,12 @@ static inline void __relay_write(struct rchan *chan,
> struct rchan_buf *buf;
>
> buf = chan->buf[get_cpu()];
> + spin_lock(&buf->rw_lock);
> if (unlikely(buf->offset + length >= buf->chan->subbuf_size))
> length = relay_switch_subbuf(buf, length);
> memcpy(buf->data + buf->offset, data, length);
> buf->offset += length;
> + spin_unlock(&buf->rw_lock);
> put_cpu();
> }
>
> diff --git a/kernel/relay.c b/kernel/relay.c
> index 07f25e7..250a27a 100644
> --- a/kernel/relay.c
> +++ b/kernel/relay.c
> @@ -430,6 +430,8 @@ static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
> if (!buf)
> goto free_name;
>
> + spin_lock_init(&buf->rw_lock);
> +
> buf->cpu = cpu;
> __relay_reset(buf, 1);
>
> @@ -1013,11 +1015,13 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos,
> struct rchan_buf *buf = filp->private_data;
> size_t read_start, avail;
> int ret;
> + unsigned long flags;
>
> if (!desc->count)
> return 0;
>
> mutex_lock(&filp->f_path.dentry->d_inode->i_mutex);
> + spin_lock_irqsave(&buf->rw_lock, flags);
> do {
> if (!relay_file_read_avail(buf, *ppos))
> break;
> @@ -1028,15 +1032,20 @@ static ssize_t relay_file_read_subbufs(struct file *filp, loff_t *ppos,
> break;
>
> avail = min(desc->count, avail);
> + /* subbuf_actor may sleep, so release the spinlock for now */
> + spin_unlock_irqrestore(&buf->rw_lock, flags);
> ret = subbuf_actor(read_start, buf, avail, desc, actor);
> if (desc->error < 0)
> - break;
> + goto out;
> + spin_lock_irqsave(&buf->rw_lock, flags);
>
> if (ret) {
> relay_file_read_consume(buf, read_start, ret);
> *ppos = relay_file_read_end_pos(buf, read_start, ret);
> }
> } while (desc->count && ret);
> + spin_unlock_irqrestore(&buf->rw_lock, flags);
> +out:
> mutex_unlock(&filp->f_path.dentry->d_inode->i_mutex);
>
> return desc->written;

2008-06-14 15:12:06

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Fri, 13 Jun 2008 23:26:41 -0500
Tom Zanussi <[email protected]> wrote:

> Alternatively, you could get rid of the problem by making sure CPU0
> never reads CPU1's data, by having the userspace reader use per-cpu
> threads and using sched_setaffinity() to pin each thread to a given
> cpu. See for example, the blktrace code, which does this.

Yes, and performance-wise this is better. Though I'm not sure setting
affinity is 100% safe. Will the thread be migrated soon enough, so we
don't read cross-CPU? The point is I'm not sure how hard this is
enforced.

However, I suggest this patch should go in, for two reasons:
1. It provides expected behavior in any such situation.
2. It adds (almost) no overhead when used in conjuction with setting CPU
affinity. When the writer acquires the spinlock, it does not busy-wait,
so the spinlock just disables IRQs (relay_write()).

> Actually, in a few days or so I'm planning on releasing the first cut
> of a library that makes this and all the rest of the nice blktrace
> userspace code available to other tracing applications, not just
> blktrace. Hopefully it would be something that you'd be able to use
> for kmemtrace as well; in that case, you'd just use the library and
> not have to worry about these details.

Sounds good. Thanks for letting me know.

Eduard

2008-06-14 16:16:21

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Fri, 13 Jun 2008 23:26:41 -0500
Tom Zanussi <[email protected]> wrote:
>> Alternatively, you could get rid of the problem by making sure CPU0
>> never reads CPU1's data, by having the userspace reader use per-cpu
>> threads and using sched_setaffinity() to pin each thread to a given
>> cpu. See for example, the blktrace code, which does this.

On Sat, Jun 14, 2008 at 6:11 PM, Eduard - Gabriel Munteanu
<[email protected]> wrote:
> Yes, and performance-wise this is better. Though I'm not sure setting
> affinity is 100% safe. Will the thread be migrated soon enough, so we
> don't read cross-CPU? The point is I'm not sure how hard this is
> enforced.
>
> However, I suggest this patch should go in, for two reasons:
> 1. It provides expected behavior in any such situation.
> 2. It adds (almost) no overhead when used in conjuction with setting CPU
> affinity. When the writer acquires the spinlock, it does not busy-wait,
> so the spinlock just disables IRQs (relay_write()).

Agreed. Tom, any objections to merging this patch?

2008-06-16 05:38:22

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Sat, 2008-06-14 at 19:16 +0300, Pekka Enberg wrote:
> On Fri, 13 Jun 2008 23:26:41 -0500
> Tom Zanussi <[email protected]> wrote:
> >> Alternatively, you could get rid of the problem by making sure CPU0
> >> never reads CPU1's data, by having the userspace reader use per-cpu
> >> threads and using sched_setaffinity() to pin each thread to a given
> >> cpu. See for example, the blktrace code, which does this.
>
> On Sat, Jun 14, 2008 at 6:11 PM, Eduard - Gabriel Munteanu
> <[email protected]> wrote:
> > Yes, and performance-wise this is better. Though I'm not sure setting
> > affinity is 100% safe. Will the thread be migrated soon enough, so we
> > don't read cross-CPU? The point is I'm not sure how hard this is
> > enforced.
> >
> > However, I suggest this patch should go in, for two reasons:
> > 1. It provides expected behavior in any such situation.
> > 2. It adds (almost) no overhead when used in conjuction with setting CPU
> > affinity. When the writer acquires the spinlock, it does not busy-wait,
> > so the spinlock just disables IRQs (relay_write()).
>
> Agreed. Tom, any objections to merging this patch?

Well, I suppose anyone who had a problem with it would make their own
local copy of relay_write() without it, or more likely they'd be using
relay_reserve() anyway. It does add some code to read() which can't be
gotten around, but since it adds (almost) no overhead, it shouldn't be a
problem.

So I guess I don't have strong feelings about it if it's solving a real
usability problem and doesn't cause any performance (or other)
regressions.

Tom

2008-06-16 06:19:21

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

Hi Tom,

On Mon, Jun 16, 2008 at 8:38 AM, Tom Zanussi <[email protected]> wrote:
> So I guess I don't have strong feelings about it if it's solving a real
> usability problem and doesn't cause any performance (or other)
> regressions.

Is there some standard performance test for relayfs that we could use
to verify this?

2008-06-16 12:28:03

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

* Eduard - Gabriel Munteanu ([email protected]) wrote:
> On Fri, 13 Jun 2008 23:26:41 -0500
> Tom Zanussi <[email protected]> wrote:
>
> > Alternatively, you could get rid of the problem by making sure CPU0
> > never reads CPU1's data, by having the userspace reader use per-cpu
> > threads and using sched_setaffinity() to pin each thread to a given
> > cpu. See for example, the blktrace code, which does this.
>
> Yes, and performance-wise this is better. Though I'm not sure setting
> affinity is 100% safe. Will the thread be migrated soon enough, so we
> don't read cross-CPU? The point is I'm not sure how hard this is
> enforced.
>
> However, I suggest this patch should go in, for two reasons:
> 1. It provides expected behavior in any such situation.
> 2. It adds (almost) no overhead when used in conjuction with setting CPU
> affinity. When the writer acquires the spinlock, it does not busy-wait,
> so the spinlock just disables IRQs (relay_write()).
>

Hi Eduard,

Two objections against this. First, taking a spinlock _is_ slow in SMP
because it involves synchronized atomic operations.

Second, Adding a spinlock to the relay write side is bad since it
opens the door to deadly embrace between a trap handler and normal
kernel code both running tracing code.

Unless really-really needed, which does not seem to be the case, I would
strongly recommend not merging this patch.

Mathieu

> > Actually, in a few days or so I'm planning on releasing the first cut
> > of a library that makes this and all the rest of the nice blktrace
> > userspace code available to other tracing applications, not just
> > blktrace. Hopefully it would be something that you'd be able to use
> > for kmemtrace as well; in that case, you'd just use the library and
> > not have to worry about these details.
>
> Sounds good. Thanks for letting me know.
>
>
> Eduard
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-16 13:23:35

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Mon, 16 Jun 2008 08:22:49 -0400
Mathieu Desnoyers <[email protected]> wrote:

> Hi Eduard,

Hi.

> Two objections against this. First, taking a spinlock _is_ slow in SMP
> because it involves synchronized atomic operations.

In any case, if we use relay in a hot path, we are doing debugging, so
a couple of atomic operations won't be a big problem. Along with
setting affinity, this shouldn't be a problem.

> Second, Adding a spinlock to the relay write side is bad since it
> opens the door to deadly embrace between a trap handler and normal
> kernel code both running tracing code.

We disable IRQs. However, interrupts which cannot be disabled (AFAIK
this includes SMM, traps, probably NMIs (but they can be disabled at
least on x86)) either don't use kernel code (SMM) or run through safe
paths. But this is the case with all spinlocks used in interrupt context.

The only other option would be to try the lock and let relay_write fail
sometimes.

> Unless really-really needed, which does not seem to be the case, I
> would strongly recommend not merging this patch.

My question was if affinity setting offers any guarantees. If it does,
then maybe we could do without this patch.

BTW, I'm not saying affinity setting doesn't help. It helps
performance-wise and I'll use it, but it should also offer correctness.

> Mathieu

Cheers,
Eduard

2008-06-16 16:46:22

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Mon, Jun 16 2008, Eduard - Gabriel Munteanu wrote:
> On Mon, 16 Jun 2008 08:22:49 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
> > Hi Eduard,
>
> Hi.
>
> > Two objections against this. First, taking a spinlock _is_ slow in SMP
> > because it involves synchronized atomic operations.
>
> In any case, if we use relay in a hot path, we are doing debugging, so
> a couple of atomic operations won't be a big problem. Along with
> setting affinity, this shouldn't be a problem.

Ehm no, that is a completely false claim. Relay speed does matter, a
great deal. We rely heavily on eg blktrace being as ligt weight as
possible to capture millions of events in a very short time frame
without impacting performance too much. Relay is NOT just for debugging.

So I completely agree with Mathieu here, and I'm not a big fan of the
proposed solution.

--
Jens Axboe

2008-06-16 18:18:20

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

Hi Jens,

On Mon, Jun 16 2008, Eduard - Gabriel Munteanu wrote:
>> In any case, if we use relay in a hot path, we are doing debugging, so
>> a couple of atomic operations won't be a big problem. Along with
>> setting affinity, this shouldn't be a problem.

On Mon, Jun 16, 2008 at 7:46 PM, Jens Axboe <[email protected]> wrote:
> Ehm no, that is a completely false claim. Relay speed does matter, a
> great deal. We rely heavily on eg blktrace being as ligt weight as
> possible to capture millions of events in a very short time frame
> without impacting performance too much. Relay is NOT just for debugging.
>
> So I completely agree with Mathieu here, and I'm not a big fan of the
> proposed solution.

OK, so we just document that fact that you're not supposed to read
from different CPU and be done with that? Or do you have any
alternative fix in mind?

2008-06-16 18:23:39

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

* Pekka Enberg ([email protected]) wrote:
> Hi Jens,
>
> On Mon, Jun 16 2008, Eduard - Gabriel Munteanu wrote:
> >> In any case, if we use relay in a hot path, we are doing debugging, so
> >> a couple of atomic operations won't be a big problem. Along with
> >> setting affinity, this shouldn't be a problem.
>
> On Mon, Jun 16, 2008 at 7:46 PM, Jens Axboe <[email protected]> wrote:
> > Ehm no, that is a completely false claim. Relay speed does matter, a
> > great deal. We rely heavily on eg blktrace being as ligt weight as
> > possible to capture millions of events in a very short time frame
> > without impacting performance too much. Relay is NOT just for debugging.
> >
> > So I completely agree with Mathieu here, and I'm not a big fan of the
> > proposed solution.
>
> OK, so we just document that fact that you're not supposed to read
> from different CPU and be done with that? Or do you have any
> alternative fix in mind?
>

ltt-relay.c in the LTTng project implements its own buffering scheme on
top of relay buffers. It does not have such limitation. Actually, it
does not need to disable interrupts. It uses atomic counters and cmpxchg
atomic ops to manage concurrency. You might want to have a look at it.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-16 18:28:55

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Mon, Jun 16 2008, Pekka Enberg wrote:
> Hi Jens,
>
> On Mon, Jun 16 2008, Eduard - Gabriel Munteanu wrote:
> >> In any case, if we use relay in a hot path, we are doing debugging, so
> >> a couple of atomic operations won't be a big problem. Along with
> >> setting affinity, this shouldn't be a problem.
>
> On Mon, Jun 16, 2008 at 7:46 PM, Jens Axboe <[email protected]> wrote:
> > Ehm no, that is a completely false claim. Relay speed does matter, a
> > great deal. We rely heavily on eg blktrace being as ligt weight as
> > possible to capture millions of events in a very short time frame
> > without impacting performance too much. Relay is NOT just for debugging.
> >
> > So I completely agree with Mathieu here, and I'm not a big fan of the
> > proposed solution.
>
> OK, so we just document that fact that you're not supposed to read
> from different CPU and be done with that? Or do you have any
> alternative fix in mind?

Hmm dunno, that is what blktrace also did but primarily for performance
reasons. It's tricky - Tom stated that he is working on a lib to
abstract this from applications. While that is handy for telling you
what to do, it also an annoyance that you HAVE to do it that way (it's
supposed to just be a "normal" fs, not with funky restrictions).

So perhaps provide both versions in-kernel and let the kernel user
device. For blktrace, we have one app and we know we can use the faster
variant since readers are affine. For more debug style exports or where
you don't know your consumer, use the safer variant (which should be the
default action).

--
Jens Axboe

2008-06-17 04:53:41

by Tom Zanussi

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

Hi Pekka,

On Mon, 2008-06-16 at 09:19 +0300, Pekka Enberg wrote:
> Hi Tom,
>
> On Mon, Jun 16, 2008 at 8:38 AM, Tom Zanussi <[email protected]> wrote:
> > So I guess I don't have strong feelings about it if it's solving a real
> > usability problem and doesn't cause any performance (or other)
> > regressions.
>
> Is there some standard performance test for relayfs that we could use
> to verify this?

Not that I know of, so probably showing how it would affect/not affect
read and relay_write users with some before/after comparisons would have
to do.

Tom

2008-06-17 12:41:36

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Mon, 16 Jun 2008 20:28:44 +0200
Jens Axboe <[email protected]> wrote:

> Hmm dunno, that is what blktrace also did but primarily for
> performance reasons. It's tricky - Tom stated that he is working on a
> lib to abstract this from applications. While that is handy for
> telling you what to do, it also an annoyance that you HAVE to do it
> that way (it's supposed to just be a "normal" fs, not with funky
> restrictions).
>
> So perhaps provide both versions in-kernel and let the kernel user
> device. For blktrace, we have one app and we know we can use the
> faster variant since readers are affine. For more debug style exports
> or where you don't know your consumer, use the safer variant (which
> should be the default action).

This sounds good. Though short debug info can be exported through
debugfs alone, there is another use to this patch: global channels,
which currently require kernel users to write their own locking
mechanism.

So, are you fine with me patching relay _and blktrace_ code to use
faster variants named relay_write_affine() and __relay_write_affine()?
This implies having relay_write() and __relay_write() be the slower,
safer paths. Do you agree with this names, provided the functions are
documented correctly?

kmemtrace will use the affine versions and set CPU affinity anyway, but
it would be nice to have a consistent behavior from relay's part.

Cheers,
Eduard

2008-06-17 12:50:47

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Tue, 17 Jun 2008 15:39:34 +0300
Eduard - Gabriel Munteanu <[email protected]> wrote:

> kmemtrace will use the affine versions and set CPU affinity anyway,
> but it would be nice to have a consistent behavior from relay's part.
>
>
> Cheers,
> Eduard

BTW, I wrote this stuff and tested affinity behavior. It seems like
migration always occurs fast enough, even if under pressure. My machine
has two CPUs, so there are a few hardcoded values.

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>

unsigned long get_current_cpu(FILE *stat_fp)
{
char buf[256], *p;
unsigned long n_delims = 0, ret;
size_t count;

fseek(stat_fp, 0, SEEK_SET);
count = fread(buf, 1, 255, stat_fp);
buf[count] = 0;

p = buf;
while (n_delims < 38)
if (*p++ == ' ')
n_delims++;
sscanf(p, "%lu", &ret);

return ret;
}

int main(void)
{
FILE *stat_fp;
char stat_filename[255];
int err;
unsigned long old_cpu, new_cpu, cpu;
pid_t pid;
cpu_set_t cpuset;

pid = getpid();
sprintf(stat_filename, "/proc/%lu/stat", pid);
stat_fp = fopen(stat_filename, "r");
if (!stat_fp) {
printf("Couldn't open %lu", pid);
return 1;
}

old_cpu = get_current_cpu(stat_fp);
new_cpu = (old_cpu + 1) % 2;

printf("Started on CPU %lu, PID %lu, setting affinity to CPU %lu...\n",
old_cpu, pid, new_cpu);

err = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
if (err == -1){
printf("Could not retrieve the affinity!\n");
return 1;
}
printf("Previous affinity: %d %d\n",
CPU_ISSET(0, &cpuset), CPU_ISSET(1, &cpuset));
CPU_ZERO(&cpuset);
CPU_SET(new_cpu, &cpuset);
err = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
if (err == -1) {
printf("Could not set affinity!\n");
return 1;
}
err = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
if (err == -1){
printf("Could not retrieve the affinity!\n");
return 1;
}
printf("Current affinity: %d %d\n",
CPU_ISSET(0, &cpuset), CPU_ISSET(1, &cpuset));

cpu = get_current_cpu(stat_fp);
if (cpu != new_cpu) {
printf("Process migration did not occur immediately!\n"
"--- we were supposed to be on %lu, but we're on %lu\n",
new_cpu, cpu);
/* return 1; */
}

for (;;) {
cpu = get_current_cpu(stat_fp);
if (cpu != new_cpu) {
printf("Oh, we arrived on a different CPU!\n");
/* return 1; */
}
}

return 0;
}

2008-06-17 12:55:57

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

* Eduard - Gabriel Munteanu ([email protected]) wrote:
> On Mon, 16 Jun 2008 20:28:44 +0200
> Jens Axboe <[email protected]> wrote:
>
> > Hmm dunno, that is what blktrace also did but primarily for
> > performance reasons. It's tricky - Tom stated that he is working on a
> > lib to abstract this from applications. While that is handy for
> > telling you what to do, it also an annoyance that you HAVE to do it
> > that way (it's supposed to just be a "normal" fs, not with funky
> > restrictions).
> >
> > So perhaps provide both versions in-kernel and let the kernel user
> > device. For blktrace, we have one app and we know we can use the
> > faster variant since readers are affine. For more debug style exports
> > or where you don't know your consumer, use the safer variant (which
> > should be the default action).
>
> This sounds good. Though short debug info can be exported through
> debugfs alone, there is another use to this patch: global channels,
> which currently require kernel users to write their own locking
> mechanism.
>
> So, are you fine with me patching relay _and blktrace_ code to use
> faster variants named relay_write_affine() and __relay_write_affine()?
> This implies having relay_write() and __relay_write() be the slower,
> safer paths. Do you agree with this names, provided the functions are
> documented correctly?
>
> kmemtrace will use the affine versions and set CPU affinity anyway, but
> it would be nice to have a consistent behavior from relay's part.
>

How about not changing the code, not providing a safe version of
relay_write, but document its use and what kind of locking the in-kernel
user must provide ?

Mathieu

>
> Cheers,
> Eduard
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-17 13:10:49

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

* Eduard - Gabriel Munteanu ([email protected]) wrote:
> On Tue, 17 Jun 2008 15:39:34 +0300
> Eduard - Gabriel Munteanu <[email protected]> wrote:
>
> > kmemtrace will use the affine versions and set CPU affinity anyway,
> > but it would be nice to have a consistent behavior from relay's part.
> >
> >
> > Cheers,
> > Eduard
>
> BTW, I wrote this stuff and tested affinity behavior. It seems like
> migration always occurs fast enough, even if under pressure. My machine
> has two CPUs, so there are a few hardcoded values.
>

Do you get the same results when not using a buffered read to read
/proc/PID/stat ? Also, since the internal proc handle might buffer the
information, closing the file and re-opening it after the set affinity
would not be a bad idea.

Mathieu

> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <unistd.h>
> #include <string.h>
> #include <sched.h>
>
> unsigned long get_current_cpu(FILE *stat_fp)
> {
> char buf[256], *p;
> unsigned long n_delims = 0, ret;
> size_t count;
>
> fseek(stat_fp, 0, SEEK_SET);
> count = fread(buf, 1, 255, stat_fp);
> buf[count] = 0;
>
> p = buf;
> while (n_delims < 38)
> if (*p++ == ' ')
> n_delims++;
> sscanf(p, "%lu", &ret);
>
> return ret;
> }
>
> int main(void)
> {
> FILE *stat_fp;
> char stat_filename[255];
> int err;
> unsigned long old_cpu, new_cpu, cpu;
> pid_t pid;
> cpu_set_t cpuset;
>
> pid = getpid();
> sprintf(stat_filename, "/proc/%lu/stat", pid);
> stat_fp = fopen(stat_filename, "r");
> if (!stat_fp) {
> printf("Couldn't open %lu", pid);
> return 1;
> }
>
> old_cpu = get_current_cpu(stat_fp);
> new_cpu = (old_cpu + 1) % 2;
>
> printf("Started on CPU %lu, PID %lu, setting affinity to CPU %lu...\n",
> old_cpu, pid, new_cpu);
>
> err = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
> if (err == -1){
> printf("Could not retrieve the affinity!\n");
> return 1;
> }
> printf("Previous affinity: %d %d\n",
> CPU_ISSET(0, &cpuset), CPU_ISSET(1, &cpuset));
> CPU_ZERO(&cpuset);
> CPU_SET(new_cpu, &cpuset);
> err = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
> if (err == -1) {
> printf("Could not set affinity!\n");
> return 1;
> }
> err = sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset);
> if (err == -1){
> printf("Could not retrieve the affinity!\n");
> return 1;
> }
> printf("Current affinity: %d %d\n",
> CPU_ISSET(0, &cpuset), CPU_ISSET(1, &cpuset));
>
> cpu = get_current_cpu(stat_fp);
> if (cpu != new_cpu) {
> printf("Process migration did not occur immediately!\n"
> "--- we were supposed to be on %lu, but we're on %lu\n",
> new_cpu, cpu);
> /* return 1; */
> }
>
> for (;;) {
> cpu = get_current_cpu(stat_fp);
> if (cpu != new_cpu) {
> printf("Oh, we arrived on a different CPU!\n");
> /* return 1; */
> }
> }
>
> return 0;
> }
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-17 13:22:17

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Tue, 17 Jun 2008 08:55:44 -0400
Mathieu Desnoyers <[email protected]> wrote:

> How about not changing the code, not providing a safe version of
> relay_write, but document its use and what kind of locking the
> in-kernel user must provide ?
>
> Mathieu

I would have agreed with you if this was about some other kernel
function. But having the user provide locking means more than just
bracing writes with spinlocks. Essentially, one also needs to duplicate
relay_file_read() and relay_file_read_subbufs(), thus having to know
relay's internals.

When I do this, I'll split my changes into two parts, with the first just
documenting as you said. Anyway, the non-affine versions are low
priority for me right now.

BTW, did you recieve or took a look at the buffer-only channels patch
series, the one with CPU hotplug support? Sorry for being impatient,
but that early initcall is really core stuff and I'd like a review. :-)

Eduard

2008-06-17 13:37:20

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Tue, 17 Jun 2008 09:10:36 -0400
Mathieu Desnoyers <[email protected]> wrote:

> Do you get the same results when not using a buffered read to read
> /proc/PID/stat ? Also, since the internal proc handle might buffer the
> information, closing the file and re-opening it after the set affinity
> would not be a bad idea.
>
> Mathieu

Tested your suggestions now, looks fine. 5 instances end up on CPU 1,
one on CPU 0, no errors printed by the application.

Eduard

2008-06-17 13:50:44

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

* Eduard - Gabriel Munteanu ([email protected]) wrote:
> On Tue, 17 Jun 2008 09:10:36 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
> > Do you get the same results when not using a buffered read to read
> > /proc/PID/stat ? Also, since the internal proc handle might buffer the
> > information, closing the file and re-opening it after the set affinity
> > would not be a bad idea.
> >
> > Mathieu
>
> Tested your suggestions now, looks fine. 5 instances end up on CPU 1,
> one on CPU 0, no errors printed by the application.
>

Yeah, although using set affinity with CPU hotplug is broken by design.

Mathieu

>
> Eduard
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-06-17 14:56:41

by Eduard - Gabriel Munteanu

[permalink] [raw]

Subject: Re: [PATCH 2/3] relay: Fix race condition which occurs when reading across CPUs.

On Tue, 17 Jun 2008 09:50:31 -0400
Mathieu Desnoyers <[email protected]> wrote:

> Yeah, although using set affinity with CPU hotplug is broken by
> design.
>
> Mathieu

Never thought about this, but it does make sense. I presume this
affects relay reading as well, right? Quote from sched_setaffinity(2):
> EINVAL The affinity bit mask mask contains no processors that are phys-
> ically on the system, or cpusetsize is smaller than the size of
> the affinity mask used by the kernel.
We have 2 situations:
1. CPUs do _not_ get reordered --- the mask is then invalid, will it be
invalidated by the kernel even if set prior to the hotplug event?
2. CPUs get reordered after hotplug events. Assume this ([CPU logical <physical>]):
CPU 0 <0>, CPU 1 <1>, CPU 2 <2> => CPU 0 <0>, CPU 1 <2>
relay filenames likely change, but the underlying fds are preserved.
If the threads get migrated, this means we end up reading cross-CPU, AFAICS.

Do we ignore this and hope no one uses relay apps while triggering CPU
hotplug events?

Eduard