LinuxLists.cc - locking rules for ->dirty

2002-09-20 14:55:12

Subject: locking rules for ->dirty_inode()

Hello,

Documentation/filesystems/Locking states that all super operations may
block, but __set_page_dirty_buffers() calls

__mark_inode_dirty()->s_op->dirty_inode()

under mapping->private_lock spin lock. This seems strange, because file
systems' ->dirty_inode() assume that they are allowed to block. For
example, ext3_dirty_inode() allocates memory in

ext3_journal_start()->journal_start()->new_handle()->...

Nikita.

2002-09-20 15:49:53

by Andrew Morton

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Nikita Danilov wrote:
>
> Hello,
>
> Documentation/filesystems/Locking states that all super operations may
> block, but __set_page_dirty_buffers() calls
>
> __mark_inode_dirty()->s_op->dirty_inode()
>
> under mapping->private_lock spin lock. This seems strange, because file
> systems' ->dirty_inode() assume that they are allowed to block. For
> example, ext3_dirty_inode() allocates memory in
>
> ext3_journal_start()->journal_start()->new_handle()->...
>

OK, thanks.

mapping->private_lock is taken there to pin page->buffers()
(Can't lock the page because set_page_dirty is called under
page_table_lock, and other locks).

I'm sure we can just move the spin_unlock up to above the
TestSetPageDirty(), but I need to zenuflect for a while over
why I did it that way.

It's necessary to expose buffer-dirtiness and page-dirtiness
to the rest of the world in the correct order. If we set the
page dirty and then the buffers, there is a window in which writeback
could find the dirty page, try to write it, discover clean buffers
and mark the page clean. We would end up with a !PageDirty page,
on mapping->clean_pages, with dirty buffers. It would never be
written.

Yup. We can move that spin_unlock up ten lines.

2002-09-20 16:27:52

by Nikita Danilov

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Andrew Morton writes:
> Nikita Danilov wrote:
> >
> > Hello,
> >
> > Documentation/filesystems/Locking states that all super operations may
> > block, but __set_page_dirty_buffers() calls
> >
> > __mark_inode_dirty()->s_op->dirty_inode()
> >
> > under mapping->private_lock spin lock. This seems strange, because file
> > systems' ->dirty_inode() assume that they are allowed to block. For
> > example, ext3_dirty_inode() allocates memory in
> >
> > ext3_journal_start()->journal_start()->new_handle()->...
> >
>
> OK, thanks.
>
> mapping->private_lock is taken there to pin page->buffers()
> (Can't lock the page because set_page_dirty is called under
> page_table_lock, and other locks).
>
> I'm sure we can just move the spin_unlock up to above the
> TestSetPageDirty(), but I need to zenuflect for a while over
> why I did it that way.
>
> It's necessary to expose buffer-dirtiness and page-dirtiness
> to the rest of the world in the correct order. If we set the
> page dirty and then the buffers, there is a window in which writeback
> could find the dirty page, try to write it, discover clean buffers
> and mark the page clean. We would end up with a !PageDirty page,
> on mapping->clean_pages, with dirty buffers. It would never be
> written.
>
> Yup. We can move that spin_unlock up ten lines.

Actually, I came over this while trying to describe lock ordering in
reiser4 after I just started integrating other kernel locks there. I
wonder, has somebody already done this, writing up kernel lock
hierarchy, that is?

Nikita.

2002-09-20 16:42:16

by Andrew Morton

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Nikita Danilov wrote:
>
> ...
> Actually, I came over this while trying to describe lock ordering in
> reiser4 after I just started integrating other kernel locks there. I
> wonder, has somebody already done this, writing up kernel lock
> hierarchy, that is?
>

I've been keeping the comment at the top of filemap.c uptodate when
I discover things. It got smaller a while ago when certain rude
locks were spoken to.

Really, this form of representation isn't rich enough, but the
format certainly provides enough info to know when you might be
taking locks in the wrong order, and it tells you where to look
to see them being taken.

The problem with the format is that locks are only mentioned once,
and it can't describe the whole graph. Maybe it needs something
like:

* ->i_shared_lock (vmtruncate)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->swap_list_lock
* ->swap_device_lock (exclusive_swap_page, others)
* ->mapping->page_lock
* ->inode_lock (__mark_inode_dirty)
* ->sb_lock (fs/fs-writeback.c)
+* ->i_shared_lock
+* ->page_table_lock (lots of places)
*/

Don't know. Maybe someone somewhere has developed a notation
for this? How are you doing it?

2002-09-20 17:28:00

by Nikita Danilov

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Andrew Morton writes:
> Nikita Danilov wrote:
> >
> > ...
> > Actually, I came over this while trying to describe lock ordering in
> > reiser4 after I just started integrating other kernel locks there. I
> > wonder, has somebody already done this, writing up kernel lock
> > hierarchy, that is?
> >
>
> I've been keeping the comment at the top of filemap.c uptodate when
> I discover things. It got smaller a while ago when certain rude
> locks were spoken to.
>
> Really, this form of representation isn't rich enough, but the
> format certainly provides enough info to know when you might be
> taking locks in the wrong order, and it tells you where to look
> to see them being taken.
>
> The problem with the format is that locks are only mentioned once,
> and it can't describe the whole graph. Maybe it needs something
> like:
>
>
> * ->i_shared_lock (vmtruncate)
> * ->private_lock (__free_pte->__set_page_dirty_buffers)
> * ->swap_list_lock
> * ->swap_device_lock (exclusive_swap_page, others)
> * ->mapping->page_lock
> * ->inode_lock (__mark_inode_dirty)
> * ->sb_lock (fs/fs-writeback.c)
> +* ->i_shared_lock
> +* ->page_table_lock (lots of places)
> */
>
> Don't know. Maybe someone somewhere has developed a notation
> for this? How are you doing it?

I am doing it roughly in the same way: in a similar diagram where each lock is
mentioned exactly once, locks can be acquired from the left to the
right. Locks on the same indentation level are unordered and cannot be held at
the same time. This is enough to express lock *ordering*, whenever it
exists. For example, you diagram above gives:

->i_shared_lock (vmtruncate)
->private_lock (__free_pte->__set_page_dirty_buffers)
->swap_list_lock
->swap_device_lock (exclusive_swap_page, others)
->mapping->page_lock
->inode_lock (__mark_inode_dirty)
->sb_lock (fs/fs-writeback.c)
->page_table_lock (lots of places)

And it means that neither ->private_lock and ->page_table_lock nor
->swap_list_lock and ->inode_lock cannot be held at the same time.

As you mentioned (if I understood correctly) this is insufficient to
express where locks are taken and, more generally, how the lock graph is
embedded into the call graph. But lock ordering itself is quite useful.

For example, in reiser4 we have SPIN_LOCK_FUNCTIONS() macro. If one has

typedef struct txn_handle {
...
spinlock_t guard;
...
} txn_handle;

then SPIN_LOCK_FUNCTIONS(txn_handle, guard) will generate functions like

spin_lock_txn_handle();
spin_unlock_txn_handle();
spin_trylock_txn_handle(); etc.

In debugging mode, these functions (aside from manipulating locks)
modify special per-thread counters. This allows one to specify lock
ordering constraints. Each spin_lock_foo() function checks
spin_ordering_pred_foo() macro before taking lock. This was really
helpful during early debugging.

Nikita.

2002-09-20 18:16:18

by Hans Reiser

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

You might ask the SGI guys what they do. I know that they have a
systematic approach to documenting lock ordering.

Hans

2002-09-20 22:36:16

by Andrew Morton

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Nikita Danilov wrote:
>
> Hello,
>
> Documentation/filesystems/Locking states that all super operations may
> block, but __set_page_dirty_buffers() calls
>
> __mark_inode_dirty()->s_op->dirty_inode()
>
> under mapping->private_lock spin lock.

Actually it doesn't. We do not call down into the filesystem
for I_DIRTY_PAGES.

set_page_dirty() is already called under locks, via __free_pte (pagetable
teardown). 2.4 does this as well.

But I'll make the change anyway. I think it removes any
ranking requirements between mapping->page_lock and
mapping->private_lock, which is always a nice thing.

2002-09-23 19:11:07

by Andrew Morton

[permalink] [raw]

Subject: Re: locking rules for ->dirty_inode()

Nikita Danilov wrote:
>
> Andrew Morton writes:
> > Nikita Danilov wrote:
> > >
> > > Hello,
> > >
> > > Documentation/filesystems/Locking states that all super operations may
> > > block, but __set_page_dirty_buffers() calls
> > >
> > > __mark_inode_dirty()->s_op->dirty_inode()
> > >
> > > under mapping->private_lock spin lock.
> >
> > Actually it doesn't. We do not call down into the filesystem
> > for I_DIRTY_PAGES.
> >
> > set_page_dirty() is already called under locks, via __free_pte (pagetable
> > teardown). 2.4 does this as well.
>
> Cannot find __free_pte, it is only mentioned in comments in mm/filemap.c
> and include/asm-generic/tlb.h.
>

It got moved around. 2.4: __free_pte(), 2.5: zap_pte_range().

2002-09-23 19:11:06

by Nikita Danilov

[permalink] [raw]