Date: Wed, 25 Oct 2006 15:56:34 -0500
From: lkml@pengaru.com
To: linux-kernel@vger.kernel.org
Subject: rename() contention (BUG?)
Message-ID: <20061025205634.GB9100@shells.gnugeneration.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.9i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4851
Lines: 155

Hello,

I have seen some scalability problems with Maildir-based mail systems running on
Linux (debian sarge, 2.6.8), and after much investigation I found a large part
of the problem was rename() contention.

Under periods of high concurrency (multi-process or multi-threaded parallel POP
or IMAP servers), the server load would begin to skyrocket with most of the
processes in D state.  

Though the WCHAN of the processes was always "-" in these delays, I found via
strace that the delays were occuring in the rename() system call.

However, the storage was not being stressed, it was an EMC NAS being accessed
over NFS with plenty of spindles to go around and the NAS head had little load
on it.

After looking through the kernel sources I found in the VFS layer (fs/namei.c):

/*
 * p1 and p2 should be directories on the same fs.
 */
struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
{
        struct dentry *p;

        if (p1 == p2) {
                down(&p1->d_inode->i_sem);
                return NULL;
        }

        down(&p1->d_inode->i_sb->s_vfs_rename_sem);

        for (p = p1; p->d_parent != p; p = p->d_parent) {
                if (p->d_parent == p2) {
                        down(&p2->d_inode->i_sem);
                        down(&p1->d_inode->i_sem);
                        return p;
                }
        }

        for (p = p2; p->d_parent != p; p = p->d_parent) {
                if (p->d_parent == p1) {
                        down(&p1->d_inode->i_sem);
                        down(&p2->d_inode->i_sem);
                        return p;
                }
        }

        down(&p1->d_inode->i_sem);
        down(&p2->d_inode->i_sem);
        return NULL;
}


void unlock_rename(struct dentry *p1, struct dentry *p2)
{
        up(&p1->d_inode->i_sem);
        if (p1 != p2) {
                up(&p2->d_inode->i_sem);
                up(&p1->d_inode->i_sb->s_vfs_rename_sem);
        }
}
 

I also found this documented in Documentation/filesystems/directory-locking, the
problem became clear.  Though these servers were implemented using parallel
programming techniques, they were being serialized at the cross-directory
renames occuring on the same filesystem.

I was able to eliminate the symptoms by splitting what was one large
"filesystem" into 256 filesystems, which worked well enough for me not to care
so much about it.  This was done by simply mounting the subdirectories
explicitly, no changes were done to the NAS.  (mail was layed out as something
like /mail/[0-9a-f]/[0-9a-f]/[0-9a-f]/Maildir already, so it was trivial to add
more mounts).

However, after thinking more about this rename locking scheme, I must ask you
guys & gals:

What exactly is the purpose of the s_vfs_rename_sem?
Must that lock be held for the duration of the rename operation?  
Couldnt we release it after acquiring the locks on the relevant
directories?

To me, it looks like the s_vfs_rename_sem is for making the acquisition
of the multiple directory locks atomic to prevent deadlock.  But once you hold
the locks, couldnt you release the s_vfs_rename_sem?

For example:


/*
 * p1 and p2 should be directories on the same fs.
 */
struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
{
        struct dentry *p;

        if (p1 == p2) {
                down(&p1->d_inode->i_sem);
                return NULL;
        }

        down(&p1->d_inode->i_sb->s_vfs_rename_sem);

        for (p = p1; p->d_parent != p; p = p->d_parent) {
                if (p->d_parent == p2) {
                        down(&p2->d_inode->i_sem);
                        down(&p1->d_inode->i_sem);
			up(&p1->d_inode->i_sb->s_vfs_rename_sem);
                        return p;
                }
        }

        for (p = p2; p->d_parent != p; p = p->d_parent) {
                if (p->d_parent == p1) {
                        down(&p1->d_inode->i_sem);
                        down(&p2->d_inode->i_sem);
			up(&p1->d_inode->i_sb->s_vfs_rename_sem);
                        return p;
                }
        }

        down(&p1->d_inode->i_sem);
        down(&p2->d_inode->i_sem);
	up(&p1->d_inode->i_sb->s_vfs_rename_sem);

        return NULL;
}


void unlock_rename(struct dentry *p1, struct dentry *p2)
{
        up(&p1->d_inode->i_sem);
        if (p1 != p2) {
                up(&p2->d_inode->i_sem);
        }
}


If we could avoid holding the lock during the (potentially lengthy, depending on
the filesystem & storage) rename operation, it would help especially these
Maildir-based mail servers.

So, am I totally off my rocker here?

Regards,
Vito Caputo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/