2006-10-25 20:56:36

by lkml

[permalink] [raw]
Subject: rename() contention (BUG?)

Hello,

I have seen some scalability problems with Maildir-based mail systems running on
Linux (debian sarge, 2.6.8), and after much investigation I found a large part
of the problem was rename() contention.

Under periods of high concurrency (multi-process or multi-threaded parallel POP
or IMAP servers), the server load would begin to skyrocket with most of the
processes in D state.

Though the WCHAN of the processes was always "-" in these delays, I found via
strace that the delays were occuring in the rename() system call.

However, the storage was not being stressed, it was an EMC NAS being accessed
over NFS with plenty of spindles to go around and the NAS head had little load
on it.

After looking through the kernel sources I found in the VFS layer (fs/namei.c):

/*
* p1 and p2 should be directories on the same fs.
*/
struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
{
struct dentry *p;

if (p1 == p2) {
down(&p1->d_inode->i_sem);
return NULL;
}

down(&p1->d_inode->i_sb->s_vfs_rename_sem);

for (p = p1; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p2) {
down(&p2->d_inode->i_sem);
down(&p1->d_inode->i_sem);
return p;
}
}

for (p = p2; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p1) {
down(&p1->d_inode->i_sem);
down(&p2->d_inode->i_sem);
return p;
}
}

down(&p1->d_inode->i_sem);
down(&p2->d_inode->i_sem);
return NULL;
}


void unlock_rename(struct dentry *p1, struct dentry *p2)
{
up(&p1->d_inode->i_sem);
if (p1 != p2) {
up(&p2->d_inode->i_sem);
up(&p1->d_inode->i_sb->s_vfs_rename_sem);
}
}


I also found this documented in Documentation/filesystems/directory-locking, the
problem became clear. Though these servers were implemented using parallel
programming techniques, they were being serialized at the cross-directory
renames occuring on the same filesystem.

I was able to eliminate the symptoms by splitting what was one large
"filesystem" into 256 filesystems, which worked well enough for me not to care
so much about it. This was done by simply mounting the subdirectories
explicitly, no changes were done to the NAS. (mail was layed out as something
like /mail/[0-9a-f]/[0-9a-f]/[0-9a-f]/Maildir already, so it was trivial to add
more mounts).

However, after thinking more about this rename locking scheme, I must ask you
guys & gals:

What exactly is the purpose of the s_vfs_rename_sem?
Must that lock be held for the duration of the rename operation?
Couldnt we release it after acquiring the locks on the relevant
directories?

To me, it looks like the s_vfs_rename_sem is for making the acquisition
of the multiple directory locks atomic to prevent deadlock. But once you hold
the locks, couldnt you release the s_vfs_rename_sem?

For example:


/*
* p1 and p2 should be directories on the same fs.
*/
struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
{
struct dentry *p;

if (p1 == p2) {
down(&p1->d_inode->i_sem);
return NULL;
}

down(&p1->d_inode->i_sb->s_vfs_rename_sem);

for (p = p1; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p2) {
down(&p2->d_inode->i_sem);
down(&p1->d_inode->i_sem);
up(&p1->d_inode->i_sb->s_vfs_rename_sem);
return p;
}
}

for (p = p2; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p1) {
down(&p1->d_inode->i_sem);
down(&p2->d_inode->i_sem);
up(&p1->d_inode->i_sb->s_vfs_rename_sem);
return p;
}
}

down(&p1->d_inode->i_sem);
down(&p2->d_inode->i_sem);
up(&p1->d_inode->i_sb->s_vfs_rename_sem);

return NULL;
}


void unlock_rename(struct dentry *p1, struct dentry *p2)
{
up(&p1->d_inode->i_sem);
if (p1 != p2) {
up(&p2->d_inode->i_sem);
}
}



If we could avoid holding the lock during the (potentially lengthy, depending on
the filesystem & storage) rename operation, it would help especially these
Maildir-based mail servers.

So, am I totally off my rocker here?

Regards,
Vito Caputo


2006-10-25 21:13:52

by Josef Sipek

[permalink] [raw]
Subject: Re: rename() contention (BUG?)

On Wed, Oct 25, 2006 at 03:56:34PM -0500, [email protected] wrote:
> Hello,
>
> I have seen some scalability problems with Maildir-based mail systems running on
> Linux (debian sarge, 2.6.8), and after much investigation I found a large part
> of the problem was rename() contention.

Just FYI, around 2.6.15 the i_sem was replace with i_mutex which is much
better/faster. (And I would strongly suggest you upgrade.) I would actually
like to know how much the lock contention decresed for your case.

(I'm going to read the rest when I get back.)

Josef "Jeff" Sipek.

--
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like
that.
- Linus Torvalds

2006-10-26 18:43:40

by Avi Kivity

[permalink] [raw]
Subject: Re: rename() contention (BUG?)

Josef Sipek wrote:
> On Wed, Oct 25, 2006 at 03:56:34PM -0500, [email protected] wrote:
>
>> Hello,
>>
>> I have seen some scalability problems with Maildir-based mail systems running on
>> Linux (debian sarge, 2.6.8), and after much investigation I found a large part
>> of the problem was rename() contention.
>>
>
> Just FYI, around 2.6.15 the i_sem was replace with i_mutex which is much
> better/faster. (And I would strongly suggest you upgrade.) I would actually
> like to know how much the lock contention decresed for your case.
>

The changes make the mutex more efficient, but won't decrease the
contention. It seems that all renames in one filesystem are serialized,
and if the renames require I/O (which is certainly the case with nfs),
rename throughput is severely limited.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-26 19:22:28

by Al Viro

[permalink] [raw]
Subject: Re: rename() contention (BUG?)

On Thu, Oct 26, 2006 at 08:43:34PM +0200, Avi Kivity wrote:
> The changes make the mutex more efficient, but won't decrease the
> contention. It seems that all renames in one filesystem are serialized,
> and if the renames require I/O (which is certainly the case with nfs),
> rename throughput is severely limited.

They are, and for a good reason. For details see
Documentation/filesystems/directory-locking.

2006-10-26 21:20:06

by Avi Kivity

[permalink] [raw]
Subject: Re: rename() contention (BUG?)

Al Viro wrote:
> On Thu, Oct 26, 2006 at 08:43:34PM +0200, Avi Kivity wrote:
>
>> The changes make the mutex more efficient, but won't decrease the
>> contention. It seems that all renames in one filesystem are serialized,
>> and if the renames require I/O (which is certainly the case with nfs),
>> rename throughput is severely limited.
>>
>
> They are, and for a good reason. For details see
> Documentation/filesystems/directory-locking.
>

Is it possible to lock only the common subtree of the two paths?

Perhaps walk towards the root of the tree, starting with the deeper
path, locking one component at a time. Then walk both paths together
locking components ordered by something to avoid deadlock.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-10-26 21:37:52

by Al Viro

[permalink] [raw]
Subject: Re: rename() contention (BUG?)

On Thu, Oct 26, 2006 at 11:19:56PM +0200, Avi Kivity wrote:
> Al Viro wrote:
> >On Thu, Oct 26, 2006 at 08:43:34PM +0200, Avi Kivity wrote:
> >
> >>The changes make the mutex more efficient, but won't decrease the
> >>contention. It seems that all renames in one filesystem are serialized,
> >>and if the renames require I/O (which is certainly the case with nfs),
> >>rename throughput is severely limited.
> >>
> >
> > They are, and for a good reason. For details see
> >Documentation/filesystems/directory-locking.
> >
>
> Is it possible to lock only the common subtree of the two paths?
>
> Perhaps walk towards the root of the tree, starting with the deeper
> path, locking one component at a time. Then walk both paths together
> locking components ordered by something to avoid deadlock.

Please, read the file mentioned above.