2007-09-24 21:15:21

by Steven Rostedt

[permalink] [raw]
Subject: [HACK] convert i_alloc_sem for direct_io.c craziness!


Hopefully I will get some attention from those that are responsible for
fs/direct_io.c

Ingo and Thomas,

This patch converts the i_alloc_sem into a compat_rw_semaphore for the -rt
patch. Seems that the code in fs/direct_io.c does some nasty logic with
the i_alloc_sem. For DIO_LOCKING, I'm assuming that the i_alloc_sem is
used as a reference counter for pending requests. When the request is
made, the down_read is performed. When the request is handled by the block
softirq, then that softirq does an up on the request. So the owner is not
the same between down and up. When all requests are handled, the semaphore
counter should be zero. This keeps away any write access while requests
are pending.

Now this may all be well and dandy for vanilla Linux, but it breaks
miserbly when converted to -rt.

1) In RT rw_semaphores must be up'd by the same thread that down's it.

2) We can't do PI on the correct processes.

This patch converts (for now) the i_alloc_sem into a compat_rw_semaphore
to give back the old features to the sem. This fixes deadlocks that we've
been having WRT direct_io. But unfortunately, it now opens up
unbonded priority inversion with this semaphore. But really, those that
can be affected by this, shouldn't be doing disk IO anyway.

The real fix would be to get rid of the read semaphore trickery in
direct_io.c.

Signed-off-by: Steve Rostedt <[email protected]>

Index: linux-2.6.23-rc4-rt1/include/linux/fs.h
===================================================================
--- linux-2.6.23-rc4-rt1.orig/include/linux/fs.h 2007-09-24 16:58:59.000000000 -0400
+++ linux-2.6.23-rc4-rt1/include/linux/fs.h 2007-09-24 16:59:11.000000000 -0400
@@ -577,7 +577,7 @@ struct inode {
umode_t i_mode;
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
struct mutex i_mutex;
- struct rw_semaphore i_alloc_sem;
+ struct compat_rw_semaphore i_alloc_sem;
const struct inode_operations *i_op;
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct super_block *i_sb;


2007-09-25 12:43:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!

On Mon, 24 Sep 2007 17:14:26 -0400 (EDT) Steven Rostedt
<[email protected]> wrote:

>
> Hopefully I will get some attention from those that are responsible for
> fs/direct_io.c
>
> Ingo and Thomas,
>
> This patch converts the i_alloc_sem into a compat_rw_semaphore for the -rt
> patch. Seems that the code in fs/direct_io.c does some nasty logic with
> the i_alloc_sem. For DIO_LOCKING, I'm assuming that the i_alloc_sem is
> used as a reference counter for pending requests. When the request is
> made, the down_read is performed. When the request is handled by the block
> softirq, then that softirq does an up on the request. So the owner is not
> the same between down and up. When all requests are handled, the semaphore
> counter should be zero. This keeps away any write access while requests
> are pending.
>
> Now this may all be well and dandy for vanilla Linux, but it breaks
> miserbly when converted to -rt.
>
> 1) In RT rw_semaphores must be up'd by the same thread that down's it.
>
> 2) We can't do PI on the correct processes.
>
> This patch converts (for now) the i_alloc_sem into a compat_rw_semaphore
> to give back the old features to the sem. This fixes deadlocks that we've
> been having WRT direct_io. But unfortunately, it now opens up
> unbonded priority inversion with this semaphore. But really, those that
> can be affected by this, shouldn't be doing disk IO anyway.
>
> The real fix would be to get rid of the read semaphore trickery in
> direct_io.c.

How about teaching {up,down}_read_non_owner() to barf on rw_semaphore
in -rt?

> Signed-off-by: Steve Rostedt <[email protected]>
>
> Index: linux-2.6.23-rc4-rt1/include/linux/fs.h
> ===================================================================
> --- linux-2.6.23-rc4-rt1.orig/include/linux/fs.h 2007-09-24 16:58:59.000000000 -0400
> +++ linux-2.6.23-rc4-rt1/include/linux/fs.h 2007-09-24 16:59:11.000000000 -0400
> @@ -577,7 +577,7 @@ struct inode {
> umode_t i_mode;
> spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
> struct mutex i_mutex;
> - struct rw_semaphore i_alloc_sem;
> + struct compat_rw_semaphore i_alloc_sem;
> const struct inode_operations *i_op;
> const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
> struct super_block *i_sb;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-09-25 15:30:44

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT] Don't let -rt rw_semaphors do _non_owner locks


--
On Tue, 25 Sep 2007, Peter Zijlstra wrote:

> How about teaching {up,down}_read_non_owner() to barf on rw_semaphore
> in -rt?
>

Sure thing!

This patch prevents rw_semaphore in PREEMPT_RT from performing
down_read_non_owner and up_read_non_owner. If this must be used, then
either convert to a completion or use compat_rw_semaphore.

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-2.6.23-rc4-rt1/include/linux/rt_lock.h
===================================================================
--- linux-2.6.23-rc4-rt1.orig/include/linux/rt_lock.h 2007-09-25 09:38:56.000000000 -0400
+++ linux-2.6.23-rc4-rt1/include/linux/rt_lock.h 2007-09-25 09:42:19.000000000 -0400
@@ -248,25 +248,20 @@ do { \
__rt_rwsem_init((sem), #sem, &__key); \
} while (0)

+extern void __dont_do_this_in_rt(struct rw_semaphore *rwsem);
+
+#define rt_down_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem)
+#define rt_up_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem)
+
extern void fastcall rt_down_write(struct rw_semaphore *rwsem);
extern void fastcall
rt_down_read_nested(struct rw_semaphore *rwsem, int subclass);
extern void fastcall
rt_down_write_nested(struct rw_semaphore *rwsem, int subclass);
extern void fastcall rt_down_read(struct rw_semaphore *rwsem);
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-extern void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem);
-#else
-# define rt_down_read_non_owner(rwsem) rt_down_read(rwsem)
-#endif
extern int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem);
extern int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem);
extern void fastcall rt_up_read(struct rw_semaphore *rwsem);
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-extern void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem);
-#else
-# define rt_up_read_non_owner(rwsem) rt_up_read(rwsem)
-#endif
extern void fastcall rt_up_write(struct rw_semaphore *rwsem);
extern void fastcall rt_downgrade_write(struct rw_semaphore *rwsem);

Index: linux-2.6.23-rc4-rt1/kernel/rt.c
===================================================================
--- linux-2.6.23-rc4-rt1.orig/kernel/rt.c 2007-09-25 09:38:56.000000000 -0400
+++ linux-2.6.23-rc4-rt1/kernel/rt.c 2007-09-25 09:41:50.000000000 -0400
@@ -324,25 +324,6 @@ void fastcall rt_up_read(struct rw_semap
}
EXPORT_SYMBOL(rt_up_read);

-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem)
-{
- unsigned long flags;
- /*
- * Read locks within the self-held write lock succeed.
- */
- spin_lock_irqsave(&rwsem->lock.wait_lock, flags);
- if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) {
- spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags);
- rwsem->read_depth--;
- return;
- }
- spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags);
- rt_mutex_unlock(&rwsem->lock);
-}
-EXPORT_SYMBOL(rt_up_read_non_owner);
-#endif
-
/*
* downgrade a write lock into a read lock
* - just wake up any readers at the front of the queue
@@ -433,32 +414,6 @@ void fastcall rt_down_read_nested(struct
}
EXPORT_SYMBOL(rt_down_read_nested);

-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-
-/*
- * Same as rt_down_read() but no lockdep calls:
- */
-void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem)
-{
- unsigned long flags;
- /*
- * Read locks within the write lock succeed.
- */
- spin_lock_irqsave(&rwsem->lock.wait_lock, flags);
-
- if (rt_mutex_real_owner(&rwsem->lock) == current) {
- spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags);
- rwsem->read_depth++;
- return;
- }
- spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags);
- rt_mutex_lock(&rwsem->lock);
-}
-EXPORT_SYMBOL(rt_down_read_non_owner);
-
-#endif
-
void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name,
struct lock_class_key *key)
{

2007-10-01 19:53:26

by Zach Brown

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!

> Hopefully I will get some attention from those that are responsible for
> fs/direct_io.c

[ many days later, I find this amongst the lkml noise. ]

> This patch converts the i_alloc_sem into a compat_rw_semaphore for the -rt
> patch. Seems that the code in fs/direct_io.c does some nasty logic with
> the i_alloc_sem. For DIO_LOCKING, I'm assuming that the i_alloc_sem is
> used as a reference counter for pending requests. When the request is
> made, the down_read is performed. When the request is handled by the block
> softirq, then that softirq does an up on the request. So the owner is not
> the same between down and up. When all requests are handled, the semaphore
> counter should be zero. This keeps away any write access while requests
> are pending.

Yeah. This is used for performing concurrent asynchronous O_DIRECT
operations.

> This patch converts (for now) the i_alloc_sem into a compat_rw_semaphore
> to give back the old features to the sem. This fixes deadlocks that we've
> been having WRT direct_io.

*nod*

> The real fix would be to get rid of the read semaphore trickery in
> direct_io.c.

Do you have any suggestions for locking constructs that RT would prefer?

The core problem is that async IO is in flight while no process holds
the usual locks in system calls. We don't want the blocks referenced by
IOs to be freed and realocated some where else while the IO is in
flight. Hence the i_alloc_sem acquiry in the file block modification
paths: vmtruncate - free, notify_change - a proxy for allocating writes.

The agents are:

- many tasks issuing concurrent async IO and exiting from system calls
while the IO is still in flight

- operations completed in interrupt handlers from storage devices

- tasks changing file block mapping via system calls

There's some long-term work to integrate the locking between the
buffered and O_DIRECT paths, but it's not close to ready.

- z

2007-10-01 20:44:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!


On Mon, 2007-10-01 at 12:52 -0700, Zach Brown wrote:

> Do you have any suggestions for locking constructs that RT would prefer?

Basically, anything that maps to a simple mutex. Anything more complex
gets real messy real quick.

Locks that have non-exclusive states become non-deterministic because an
unbounded number of contexts can be in this state. Hence acquisition of
the exclusive state has unbounded time. Even when limited to a bounded
number, the ramifications to the PI graph will get you a head-ache.

Also, non-owner locks, ie. semaphores (asymetric acquisition vs release
contexts) are unusable because the lack of ownership undermines PI - who
to boost?


2007-10-02 18:03:56

by Zach Brown

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!


On Oct 1, 2007, at 1:39 PM, Peter Zijlstra wrote:

>
> On Mon, 2007-10-01 at 12:52 -0700, Zach Brown wrote:
>
>> Do you have any suggestions for locking constructs that RT would
>> prefer?
>
> Basically, anything that maps to a simple mutex. Anything more complex
> gets real messy real quick.

I'm worried that the aio+dio implementation of concurrent pending IOs
just doesn't map well to PI.

Would a hack with a mutex and counts help at all? It seems like it
would still have the same problem. The count increments don't
transfer ownership to the count decrements and the wake up.

io submission from tasks:
down(&inode->i_mutex);
atomic_inc(&inode->in_flight);
up(&inode->i_mutex);

io completion from interrupts:
if(atomic_dec_and_test(&inode->in_flight))
wake_up(&inode->waiting);

file allocation in tasks:
down(&inode->i_mutex);
wait_event(inode->waiting, atomic_read(&inode->in_flight) == 0);
up(&inode->i_mutex);

(yeah, yeah, starvation -- it's just a demonstration)

In any case, this seems like it's not a very high priority now that
RT has Steven's work-around. If it does become a priority can you
guys let linux-fsdevel know?

- z

2007-10-02 18:12:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!


Hi Zach!

Thanks for the responses.

--
On Tue, 2 Oct 2007, Zach Brown wrote:
>
> On Oct 1, 2007, at 1:39 PM, Peter Zijlstra wrote:
>
> >
> > On Mon, 2007-10-01 at 12:52 -0700, Zach Brown wrote:
> >
> >> Do you have any suggestions for locking constructs that RT would
> >> prefer?
> >
> > Basically, anything that maps to a simple mutex. Anything more complex
> > gets real messy real quick.
>
> I'm worried that the aio+dio implementation of concurrent pending IOs
> just doesn't map well to PI.
>
> Would a hack with a mutex and counts help at all? It seems like it
> would still have the same problem. The count increments don't
> transfer ownership to the count decrements and the wake up.
>
> io submission from tasks:
> down(&inode->i_mutex);
> atomic_inc(&inode->in_flight);
> up(&inode->i_mutex);
>
> io completion from interrupts:
> if(atomic_dec_and_test(&inode->in_flight))
> wake_up(&inode->waiting);
>
> file allocation in tasks:
> down(&inode->i_mutex);
> wait_event(inode->waiting, atomic_read(&inode->in_flight) == 0);
> up(&inode->i_mutex);
>
> (yeah, yeah, starvation -- it's just a demonstration)

This looks more like a completion. Actually, completions don't have PI
either, but they are usually OK.

>
> In any case, this seems like it's not a very high priority now that
> RT has Steven's work-around. If it does become a priority can you
> guys let linux-fsdevel know?

Actually, it may still be a high priority. We have one BUG currently that
seems to trigger starvation. But we don't know yet if it's a bug in the
test or in the FS code yet. I don't know the full details yet, but we do
have someone currently looking at it. These tests are what found this
issue in the first place.

When more info becomes available, I will definitely CC the linux-fsdevel
list. Thanks for letting me know about it. (I usually get confused by all
the different lists that are out there).

-- Steve

2007-10-02 18:19:04

by Zach Brown

[permalink] [raw]
Subject: Re: [HACK] convert i_alloc_sem for direct_io.c craziness!

> This looks more like a completion. Actually, completions don't have PI
> either, but they are usually OK.

Yeah, true.

> When more info becomes available, I will definitely CC the linux-
> fsdevel
> list. Thanks for letting me know about it. (I usually get confused
> by all
> the different lists that are out there).

OK, thanks. (Yeah, it's not a great situation.)

- z