2013-10-02 15:36:58

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH 1/2] fs/ext4: adding and initalizing new members of ext4_inode_info and ext4_sb_info

Adding new members, i_prev_oprhan to help decoupling the ondisk from the
in memory orphan list and i_mutex_orphan_mutex to serialize orphan list
updates on a single inode, to the ext4_inode_info structure.

Adding a new member, s_ondisk_oprhan_lock to protect the ondisk orphan list
and change s_orphan_lock from mutex to spinlock in the ext4_sb_info
structure in fs/ext4/inode.c.

Adding code to initialize newly added spinlock and mutex members in
fs/ext4/super.c.

Adding code to initialize the newly added i_orphan_mutex member of an
ext4_inode_info

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/ext4/ext4.h | 5 ++++-
fs/ext4/inode.c | 1 +
fs/ext4/super.c | 4 +++-
3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..1b1232f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -817,6 +817,7 @@ struct ext4_inode_info {
struct rw_semaphore xattr_sem;

struct list_head i_orphan; /* unlinked but open inodes */
+ struct mutex i_orphan_mutex;

/*
* i_disksize keeps track of what the inode size is ON DISK, not
@@ -864,6 +865,7 @@ struct ext4_inode_info {
rwlock_t i_es_lock;
struct list_head i_es_lru;
unsigned int i_es_lru_nr; /* protected by i_es_lock */
+ unsigned int i_prev_orphan; /* protected by s_ondisk_orphan_lock */
unsigned long i_touch_when; /* jiffies of last accessing */

/* ialloc */
@@ -1201,7 +1203,8 @@ struct ext4_sb_info {
/* Journaling */
struct journal_s *s_journal;
struct list_head s_orphan;
- struct mutex s_orphan_lock;
+ spinlock_t s_orphan_lock;
+ struct mutex s_ondisk_orphan_lock;
unsigned long s_resize_flags; /* Flags indicating if there
is a resizer */
unsigned long s_commit_interval;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd32a2e..3edb108 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4061,6 +4061,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
for (block = 0; block < EXT4_N_BLOCKS; block++)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);
+ mutex_init(&ei->i_orphan_mutex);

/*
* Set transaction id's of transactions that have to be committed
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..2fecd19 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -920,6 +920,7 @@ static void init_once(void *foo)
struct ext4_inode_info *ei = (struct ext4_inode_info *) foo;

INIT_LIST_HEAD(&ei->i_orphan);
+ mutex_init(&ei->i_orphan_mutex);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
inode_init_once(&ei->vfs_inode);
@@ -3841,7 +3842,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));

INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
- mutex_init(&sbi->s_orphan_lock);
+ spin_lock_init(&sbi->s_orphan_lock);
+ mutex_init(&sbi->s_ondisk_orphan_lock);

sb->s_root = NULL;

--
1.7.11.3


2013-10-02 15:36:59

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH 2/2] fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex

Instead of using a single per super block mutex, s_orphan_lock, to serialize
all orphan list updates, a separate mutex and spinlock are used to
protect the on disk and in memory orphan lists respecvitely.

At the same time, a per inode mutex is used to serialize orphan list
updates of a single inode.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/ext4/namei.c | 139 ++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 100 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 35f55a0..d7d3d0f 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2554,12 +2554,24 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
struct super_block *sb = inode->i_sb;
struct ext4_iloc iloc;
int err = 0, rc;
+ struct ext4_inode_info *ext4_inode = EXT4_I(inode);
+ struct ext4_sb_info *ext4_sb = EXT4_SB(sb);
+ struct mutex *inode_orphan_mutex;
+ spinlock_t *orphan_lock;
+ struct mutex *ondisk_orphan_mutex;
+ struct list_head *prev;
+ unsigned int next_inum;
+ struct inode *next_inode;

- if (!EXT4_SB(sb)->s_journal)
+ if (ext4_sb->s_journal)
return 0;

- mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
- if (!list_empty(&EXT4_I(inode)->i_orphan))
+ inode_orphan_mutex = &ext4_inode->i_orphan_mutex;
+ orphan_lock = &ext4_sb->s_orphan_lock;
+ ondisk_orphan_mutex = &ext4_sb->s_ondisk_orphan_lock;
+ mutex_lock(inode_orphan_mutex);
+ prev = ext4_inode->i_orphan.prev;
+ if (prev != &ext4_inode->i_orphan)
goto out_unlock;

/*
@@ -2579,17 +2591,28 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err)
goto out_unlock;
+ mutex_lock(ondisk_orphan_mutex);
/*
* Due to previous errors inode may be already a part of on-disk
* orphan list. If so skip on-disk list modification.
*/
if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
- (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
+ (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
+ mutex_unlock(ondisk_orphan_mutex);
goto mem_insert;
+ }

/* Insert this inode at the head of the on-disk orphan list... */
- NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
- EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
+ next_inum = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
+ NEXT_ORPHAN(inode) = next_inum;
+ next_inode = ilookup(sb, next_inum);
+ if (next_inode)
+ EXT4_I(next_inode)->i_prev_orphan = inode->i_ino;
+ ext4_sb->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
+ ext4_inode->i_prev_orphan = 0;
+ mutex_unlock(ondisk_orphan_mutex);
+ if (next_inode)
+ iput(next_inode);
err = ext4_handle_dirty_super(handle, sb);
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
@@ -2604,14 +2627,17 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
* This is safe: on error we're going to ignore the orphan list
* anyway on the next recovery. */
mem_insert:
- if (!err)
- list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
+ if (!err) {
+ spin_lock(orphan_lock);
+ list_add(&ext4_inode->i_orphan, &ext4_sb->s_orphan);
+ spin_unlock(orphan_lock);
+ }

jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
out_unlock:
- mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ mutex_unlock(inode_orphan_mutex);
ext4_std_error(inode->i_sb, err);
return err;
}
@@ -2624,71 +2650,106 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi;
+ struct inode *i_next = NULL;
+ struct inode *i_prev = NULL;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__u32 ino_next;
+ unsigned int ino_prev;
struct ext4_iloc iloc;
int err = 0;
+ struct mutex *inode_orphan_mutex;
+ spinlock_t *orphan_lock;
+ struct mutex *ondisk_orphan_mutex;

- if ((!EXT4_SB(inode->i_sb)->s_journal) &&
- !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
+ if ((!sbi->s_journal) &&
+ !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;

- mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
- if (list_empty(&ei->i_orphan))
- goto out;
-
- ino_next = NEXT_ORPHAN(inode);
+ inode_orphan_mutex = &ei->i_orphan_mutex;
+ orphan_lock = &sbi->s_orphan_lock;
+ ondisk_orphan_mutex = &sbi->s_ondisk_orphan_lock;
+ mutex_lock(inode_orphan_mutex);
prev = ei->i_orphan.prev;
- sbi = EXT4_SB(inode->i_sb);
+ if (prev == &ei->i_orphan) {
+ mutex_unlock(inode_orphan_mutex);
+ return err;
+ }

jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);

+ spin_lock(orphan_lock);
list_del_init(&ei->i_orphan);
+ spin_unlock(orphan_lock);

/* If we're on an error path, we may not have a valid
* transaction handle with which to update the orphan list on
* disk, but we still need to remove the inode from the linked
* list in memory. */
- if (!handle)
- goto out;
+ if (!handle) {
+ mutex_unlock(inode_orphan_mutex);
+ return err;
+ }

err = ext4_reserve_inode_write(handle, inode, &iloc);
- if (err)
+ if (err) {
+ mutex_unlock(inode_orphan_mutex);
goto out_err;
+ }

- if (prev == &sbi->s_orphan) {
+ mutex_lock(ondisk_orphan_mutex);
+ ino_next = NEXT_ORPHAN(inode);
+ if (ino_next > 0) {
+ i_next = ilookup(inode->i_sb, ino_next);
+ assert(i_next);
+ }
+ ino_prev = ei->i_prev_orphan;
+ if (!ino_prev) {
jbd_debug(4, "superblock will point to %u\n", ino_next);
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
- if (err)
- goto out_brelse;
- sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
- err = ext4_handle_dirty_super(handle, inode->i_sb);
+ if (!err) {
+ sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ if (i_next)
+ EXT4_I(i_next)->i_prev_orphan = 0;
+ }
+ mutex_unlock(ondisk_orphan_mutex);
+ if (!err)
+ err = ext4_handle_dirty_super(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
- struct inode *i_prev =
- &list_entry(prev, struct ext4_inode_info, i_orphan)->vfs_inode;
-
- jbd_debug(4, "orphan inode %lu will point to %u\n",
- i_prev->i_ino, ino_next);
- err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
- if (err)
- goto out_brelse;
- NEXT_ORPHAN(i_prev) = ino_next;
- err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
+ i_prev = ilookup(inode->i_sb, ino_prev);
+
+ assert(i_prev);
+ if (i_prev) {
+ jbd_debug(4, "orphan inode %lu will point to %u\n",
+ i_prev->i_ino, ino_next);
+ err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
+ } else
+ err = -ENOENT;
+ if (!err) {
+ NEXT_ORPHAN(i_prev) = ino_next;
+ if (i_next)
+ EXT4_I(i_next)->i_prev_orphan = ino_prev;
+ }
+ mutex_unlock(ondisk_orphan_mutex);
+ if (i_prev && !err)
+ err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
}
if (err)
goto out_brelse;
NEXT_ORPHAN(inode) = 0;
+ mutex_unlock(inode_orphan_mutex);
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-
out_err:
+ if (i_next)
+ iput(i_next);
+ if (i_prev)
+ iput(i_prev);
ext4_std_error(inode->i_sb, err);
-out:
- mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
return err;

out_brelse:
+ mutex_unlock(inode_orphan_mutex);
brelse(iloc.bh);
goto out_err;
}
--
1.7.11.3

2013-10-03 02:05:26

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex

Hello,

On Wed, Oct 02, 2013 at 09:36:59AM -0600, T Makphaibulchoke wrote:
> Instead of using a single per super block mutex, s_orphan_lock, to serialize
> all orphan list updates, a separate mutex and spinlock are used to
> protect the on disk and in memory orphan lists respecvitely.
>
> At the same time, a per inode mutex is used to serialize orphan list
> updates of a single inode.
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>
> ---
> fs/ext4/namei.c | 139 ++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 100 insertions(+), 39 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 35f55a0..d7d3d0f 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2554,12 +2554,24 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> struct super_block *sb = inode->i_sb;
> struct ext4_iloc iloc;
> int err = 0, rc;
> + struct ext4_inode_info *ext4_inode = EXT4_I(inode);
> + struct ext4_sb_info *ext4_sb = EXT4_SB(sb);
> + struct mutex *inode_orphan_mutex;
> + spinlock_t *orphan_lock;
> + struct mutex *ondisk_orphan_mutex;
> + struct list_head *prev;
> + unsigned int next_inum;
> + struct inode *next_inode;
>
> - if (!EXT4_SB(sb)->s_journal)
> + if (ext4_sb->s_journal)
^^^^^^^^
typo: !ext4_sb->s_journal
I am not sure whether or not this will impact the result because when
journal is enabled the inode will not be added into orphan list.

Regards,
- Zheng

> return 0;
>
> - mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> - if (!list_empty(&EXT4_I(inode)->i_orphan))
> + inode_orphan_mutex = &ext4_inode->i_orphan_mutex;
> + orphan_lock = &ext4_sb->s_orphan_lock;
> + ondisk_orphan_mutex = &ext4_sb->s_ondisk_orphan_lock;
> + mutex_lock(inode_orphan_mutex);
> + prev = ext4_inode->i_orphan.prev;
> + if (prev != &ext4_inode->i_orphan)
> goto out_unlock;
>
> /*
> @@ -2579,17 +2591,28 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> if (err)
> goto out_unlock;
> + mutex_lock(ondisk_orphan_mutex);
> /*
> * Due to previous errors inode may be already a part of on-disk
> * orphan list. If so skip on-disk list modification.
> */
> if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> - (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
> + (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
> + mutex_unlock(ondisk_orphan_mutex);
> goto mem_insert;
> + }
>
> /* Insert this inode at the head of the on-disk orphan list... */
> - NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> - EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> + next_inum = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> + NEXT_ORPHAN(inode) = next_inum;
> + next_inode = ilookup(sb, next_inum);
> + if (next_inode)
> + EXT4_I(next_inode)->i_prev_orphan = inode->i_ino;
> + ext4_sb->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> + ext4_inode->i_prev_orphan = 0;
> + mutex_unlock(ondisk_orphan_mutex);
> + if (next_inode)
> + iput(next_inode);
> err = ext4_handle_dirty_super(handle, sb);
> rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> if (!err)
> @@ -2604,14 +2627,17 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> * This is safe: on error we're going to ignore the orphan list
> * anyway on the next recovery. */
> mem_insert:
> - if (!err)
> - list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
> + if (!err) {
> + spin_lock(orphan_lock);
> + list_add(&ext4_inode->i_orphan, &ext4_sb->s_orphan);
> + spin_unlock(orphan_lock);
> + }
>
> jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> out_unlock:
> - mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + mutex_unlock(inode_orphan_mutex);
> ext4_std_error(inode->i_sb, err);
> return err;
> }
> @@ -2624,71 +2650,106 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> {
> struct list_head *prev;
> struct ext4_inode_info *ei = EXT4_I(inode);
> - struct ext4_sb_info *sbi;
> + struct inode *i_next = NULL;
> + struct inode *i_prev = NULL;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> __u32 ino_next;
> + unsigned int ino_prev;
> struct ext4_iloc iloc;
> int err = 0;
> + struct mutex *inode_orphan_mutex;
> + spinlock_t *orphan_lock;
> + struct mutex *ondisk_orphan_mutex;
>
> - if ((!EXT4_SB(inode->i_sb)->s_journal) &&
> - !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
> + if ((!sbi->s_journal) &&
> + !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> - if (list_empty(&ei->i_orphan))
> - goto out;
> -
> - ino_next = NEXT_ORPHAN(inode);
> + inode_orphan_mutex = &ei->i_orphan_mutex;
> + orphan_lock = &sbi->s_orphan_lock;
> + ondisk_orphan_mutex = &sbi->s_ondisk_orphan_lock;
> + mutex_lock(inode_orphan_mutex);
> prev = ei->i_orphan.prev;
> - sbi = EXT4_SB(inode->i_sb);
> + if (prev == &ei->i_orphan) {
> + mutex_unlock(inode_orphan_mutex);
> + return err;
> + }
>
> jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
>
> + spin_lock(orphan_lock);
> list_del_init(&ei->i_orphan);
> + spin_unlock(orphan_lock);
>
> /* If we're on an error path, we may not have a valid
> * transaction handle with which to update the orphan list on
> * disk, but we still need to remove the inode from the linked
> * list in memory. */
> - if (!handle)
> - goto out;
> + if (!handle) {
> + mutex_unlock(inode_orphan_mutex);
> + return err;
> + }
>
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> + if (err) {
> + mutex_unlock(inode_orphan_mutex);
> goto out_err;
> + }
>
> - if (prev == &sbi->s_orphan) {
> + mutex_lock(ondisk_orphan_mutex);
> + ino_next = NEXT_ORPHAN(inode);
> + if (ino_next > 0) {
> + i_next = ilookup(inode->i_sb, ino_next);
> + assert(i_next);
> + }
> + ino_prev = ei->i_prev_orphan;
> + if (!ino_prev) {
> jbd_debug(4, "superblock will point to %u\n", ino_next);
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
> err = ext4_journal_get_write_access(handle, sbi->s_sbh);
> - if (err)
> - goto out_brelse;
> - sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> - err = ext4_handle_dirty_super(handle, inode->i_sb);
> + if (!err) {
> + sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + if (i_next)
> + EXT4_I(i_next)->i_prev_orphan = 0;
> + }
> + mutex_unlock(ondisk_orphan_mutex);
> + if (!err)
> + err = ext4_handle_dirty_super(handle, inode->i_sb);
> } else {
> struct ext4_iloc iloc2;
> - struct inode *i_prev =
> - &list_entry(prev, struct ext4_inode_info, i_orphan)->vfs_inode;
> -
> - jbd_debug(4, "orphan inode %lu will point to %u\n",
> - i_prev->i_ino, ino_next);
> - err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> - if (err)
> - goto out_brelse;
> - NEXT_ORPHAN(i_prev) = ino_next;
> - err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> + i_prev = ilookup(inode->i_sb, ino_prev);
> +
> + assert(i_prev);
> + if (i_prev) {
> + jbd_debug(4, "orphan inode %lu will point to %u\n",
> + i_prev->i_ino, ino_next);
> + err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> + } else
> + err = -ENOENT;
> + if (!err) {
> + NEXT_ORPHAN(i_prev) = ino_next;
> + if (i_next)
> + EXT4_I(i_next)->i_prev_orphan = ino_prev;
> + }
> + mutex_unlock(ondisk_orphan_mutex);
> + if (i_prev && !err)
> + err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> }
> if (err)
> goto out_brelse;
> NEXT_ORPHAN(inode) = 0;
> + mutex_unlock(inode_orphan_mutex);
> err = ext4_mark_iloc_dirty(handle, inode, &iloc);
> -
> out_err:
> + if (i_next)
> + iput(i_next);
> + if (i_prev)
> + iput(i_prev);
> ext4_std_error(inode->i_sb, err);
> -out:
> - mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> return err;
>
> out_brelse:
> + mutex_unlock(inode_orphan_mutex);
> brelse(iloc.bh);
> goto out_err;
> }
> --
> 1.7.11.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Subject: Re: [PATCH 2/2] fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex

On 10/02/2013 08:05 PM, Zheng Liu wrote:
> Hello,
>

>> - if (!EXT4_SB(sb)->s_journal)
>> + if (ext4_sb->s_journal)
> ^^^^^^^^
> typo: !ext4_sb->s_journal
> I am not sure whether or not this will impact the result because when
> journal is enabled the inode will not be added into orphan list.
>
> Regards,
> - Zheng
>
Thanks Zheng for pointing out the problem. Will fix the problem and rerun the test.

Thanks,
Mak.

2013-10-04 00:37:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ext4: adding and initalizing new members of ext4_inode_info and ext4_sb_info

On 2013-10-02, at 9:36 AM, T Makphaibulchoke wrote:
> Adding new members, i_prev_oprhan to help decoupling the ondisk from the
> in memory orphan list and i_mutex_orphan_mutex to serialize orphan list
> updates on a single inode, to the ext4_inode_info structure.

What do these additional fields do to the size of struct ext4_inode_info?
I recall that Ted did a bunch of work to shrink this enough to fit nicely
into a slab, and it would be a shame to increase the inode size to overflow
the current packing and increase per-inode memory usage by 25-33%, for an
improvement that is only noticeable on a 90-core machine.

Is there another lock that could be shared for this that is unlikely to
cause much contention?

Also, it isn't clear to me why this patch is separate from 2/2, because
all it does is add fields that are not used for anything. I don't think
the 8 lines of code here are so complex that they can't be part of the
same patch that is actually using them.

Cheers, Andreas

> Adding a new member, s_ondisk_oprhan_lock to protect the ondisk orphan list
> and change s_orphan_lock from mutex to spinlock in the ext4_sb_info
> structure in fs/ext4/inode.c.
>
> Adding code to initialize newly added spinlock and mutex members in
> fs/ext4/super.c.
>
> Adding code to initialize the newly added i_orphan_mutex member of an
> ext4_inode_info
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>
> ---
> fs/ext4/ext4.h | 5 ++++-
> fs/ext4/inode.c | 1 +
> fs/ext4/super.c | 4 +++-
> 3 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b577e45..1b1232f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -817,6 +817,7 @@ struct ext4_inode_info {
> struct rw_semaphore xattr_sem;
>
> struct list_head i_orphan; /* unlinked but open inodes */
> + struct mutex i_orphan_mutex;
>
> /*
> * i_disksize keeps track of what the inode size is ON DISK, not
> @@ -864,6 +865,7 @@ struct ext4_inode_info {
> rwlock_t i_es_lock;
> struct list_head i_es_lru;
> unsigned int i_es_lru_nr; /* protected by i_es_lock */
> + unsigned int i_prev_orphan; /* protected by s_ondisk_orphan_lock */
> unsigned long i_touch_when; /* jiffies of last accessing */
>
> /* ialloc */
> @@ -1201,7 +1203,8 @@ struct ext4_sb_info {
> /* Journaling */
> struct journal_s *s_journal;
> struct list_head s_orphan;
> - struct mutex s_orphan_lock;
> + spinlock_t s_orphan_lock;
> + struct mutex s_ondisk_orphan_lock;
> unsigned long s_resize_flags; /* Flags indicating if there
> is a resizer */
> unsigned long s_commit_interval;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index dd32a2e..3edb108 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4061,6 +4061,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> for (block = 0; block < EXT4_N_BLOCKS; block++)
> ei->i_data[block] = raw_inode->i_block[block];
> INIT_LIST_HEAD(&ei->i_orphan);
> + mutex_init(&ei->i_orphan_mutex);
>
> /*
> * Set transaction id's of transactions that have to be committed
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 36b141e..2fecd19 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -920,6 +920,7 @@ static void init_once(void *foo)
> struct ext4_inode_info *ei = (struct ext4_inode_info *) foo;
>
> INIT_LIST_HEAD(&ei->i_orphan);
> + mutex_init(&ei->i_orphan_mutex);
> init_rwsem(&ei->xattr_sem);
> init_rwsem(&ei->i_data_sem);
> inode_init_once(&ei->vfs_inode);
> @@ -3841,7 +3842,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
>
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> - mutex_init(&sbi->s_orphan_lock);
> + spin_lock_init(&sbi->s_orphan_lock);
> + mutex_init(&sbi->s_ondisk_orphan_lock);
>
> sb->s_root = NULL;
>
> --
> 1.7.11.3
>


Cheers, Andreas

2013-10-04 00:41:51

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex


On 2013-10-02, at 9:36 AM, T Makphaibulchoke wrote:

> Instead of using a single per super block mutex, s_orphan_lock, to serialize
> all orphan list updates, a separate mutex and spinlock are used to
> protect the on disk and in memory orphan lists respecvitely.
>
> At the same time, a per inode mutex is used to serialize orphan list
> updates of a single inode.
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>
> ---
> fs/ext4/namei.c | 139 ++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 100 insertions(+), 39 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 35f55a0..d7d3d0f 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2554,12 +2554,24 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> struct super_block *sb = inode->i_sb;
> struct ext4_iloc iloc;
> int err = 0, rc;
> + struct ext4_inode_info *ext4_inode = EXT4_I(inode);
> + struct ext4_sb_info *ext4_sb = EXT4_SB(sb);
> + struct mutex *inode_orphan_mutex;
> + spinlock_t *orphan_lock;
> + struct mutex *ondisk_orphan_mutex;
> + struct list_head *prev;
> + unsigned int next_inum;
> + struct inode *next_inode;

Stack space in the kernel is not so abundant that all (or any?) of these
should get their own local variable.

>
> - if (!EXT4_SB(sb)->s_journal)
> + if (ext4_sb->s_journal)
> return 0;
>
> - mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> - if (!list_empty(&EXT4_I(inode)->i_orphan))
> + inode_orphan_mutex = &ext4_inode->i_orphan_mutex;
> + orphan_lock = &ext4_sb->s_orphan_lock;
> + ondisk_orphan_mutex = &ext4_sb->s_ondisk_orphan_lock;
> + mutex_lock(inode_orphan_mutex);
> + prev = ext4_inode->i_orphan.prev;
> + if (prev != &ext4_inode->i_orphan)
> goto out_unlock;
>
> /*
> @@ -2579,17 +2591,28 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> if (err)
> goto out_unlock;
> + mutex_lock(ondisk_orphan_mutex);
> /*
> * Due to previous errors inode may be already a part of on-disk
> * orphan list. If so skip on-disk list modification.
> */
> if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> - (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
> + (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
> + mutex_unlock(ondisk_orphan_mutex);
> goto mem_insert;
> + }
>
> /* Insert this inode at the head of the on-disk orphan list... */
> - NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> - EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> + next_inum = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> + NEXT_ORPHAN(inode) = next_inum;
> + next_inode = ilookup(sb, next_inum);
> + if (next_inode)
> + EXT4_I(next_inode)->i_prev_orphan = inode->i_ino;
> + ext4_sb->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> + ext4_inode->i_prev_orphan = 0;
> + mutex_unlock(ondisk_orphan_mutex);
> + if (next_inode)
> + iput(next_inode);
> err = ext4_handle_dirty_super(handle, sb);
> rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> if (!err)
> @@ -2604,14 +2627,17 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> * This is safe: on error we're going to ignore the orphan list
> * anyway on the next recovery. */
> mem_insert:
> - if (!err)
> - list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
> + if (!err) {
> + spin_lock(orphan_lock);
> + list_add(&ext4_inode->i_orphan, &ext4_sb->s_orphan);
> + spin_unlock(orphan_lock);
> + }
>
> jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> out_unlock:
> - mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + mutex_unlock(inode_orphan_mutex);
> ext4_std_error(inode->i_sb, err);
> return err;
> }
> @@ -2624,71 +2650,106 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> {
> struct list_head *prev;
> struct ext4_inode_info *ei = EXT4_I(inode);
> - struct ext4_sb_info *sbi;
> + struct inode *i_next = NULL;
> + struct inode *i_prev = NULL;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> __u32 ino_next;
> + unsigned int ino_prev;
> struct ext4_iloc iloc;
> int err = 0;
> + struct mutex *inode_orphan_mutex;
> + spinlock_t *orphan_lock;
> + struct mutex *ondisk_orphan_mutex;

Same here.

> - if ((!EXT4_SB(inode->i_sb)->s_journal) &&
> - !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
> + if ((!sbi->s_journal) &&
> + !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> - if (list_empty(&ei->i_orphan))
> - goto out;
> -
> - ino_next = NEXT_ORPHAN(inode);
> + inode_orphan_mutex = &ei->i_orphan_mutex;
> + orphan_lock = &sbi->s_orphan_lock;
> + ondisk_orphan_mutex = &sbi->s_ondisk_orphan_lock;
> + mutex_lock(inode_orphan_mutex);
> prev = ei->i_orphan.prev;
> - sbi = EXT4_SB(inode->i_sb);
> + if (prev == &ei->i_orphan) {
> + mutex_unlock(inode_orphan_mutex);
> + return err;
> + }
>
> jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
>
> + spin_lock(orphan_lock);
> list_del_init(&ei->i_orphan);
> + spin_unlock(orphan_lock);
>
> /* If we're on an error path, we may not have a valid
> * transaction handle with which to update the orphan list on
> * disk, but we still need to remove the inode from the linked
> * list in memory. */
> - if (!handle)
> - goto out;
> + if (!handle) {
> + mutex_unlock(inode_orphan_mutex);
> + return err;
> + }
>
> err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> + if (err) {
> + mutex_unlock(inode_orphan_mutex);
> goto out_err;
> + }
>
> - if (prev == &sbi->s_orphan) {
> + mutex_lock(ondisk_orphan_mutex);
> + ino_next = NEXT_ORPHAN(inode);
> + if (ino_next > 0) {
> + i_next = ilookup(inode->i_sb, ino_next);
> + assert(i_next);
> + }
> + ino_prev = ei->i_prev_orphan;
> + if (!ino_prev) {
> jbd_debug(4, "superblock will point to %u\n", ino_next);
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
> err = ext4_journal_get_write_access(handle, sbi->s_sbh);
> - if (err)
> - goto out_brelse;
> - sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> - err = ext4_handle_dirty_super(handle, inode->i_sb);
> + if (!err) {
> + sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + if (i_next)
> + EXT4_I(i_next)->i_prev_orphan = 0;
> + }
> + mutex_unlock(ondisk_orphan_mutex);
> + if (!err)
> + err = ext4_handle_dirty_super(handle, inode->i_sb);
> } else {
> struct ext4_iloc iloc2;
> - struct inode *i_prev =
> - &list_entry(prev, struct ext4_inode_info, i_orphan)->vfs_inode;
> -
> - jbd_debug(4, "orphan inode %lu will point to %u\n",
> - i_prev->i_ino, ino_next);
> - err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> - if (err)
> - goto out_brelse;
> - NEXT_ORPHAN(i_prev) = ino_next;
> - err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> + i_prev = ilookup(inode->i_sb, ino_prev);
> +
> + assert(i_prev);
> + if (i_prev) {
> + jbd_debug(4, "orphan inode %lu will point to %u\n",
> + i_prev->i_ino, ino_next);
> + err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> + } else
> + err = -ENOENT;
> + if (!err) {
> + NEXT_ORPHAN(i_prev) = ino_next;
> + if (i_next)
> + EXT4_I(i_next)->i_prev_orphan = ino_prev;
> + }
> + mutex_unlock(ondisk_orphan_mutex);
> + if (i_prev && !err)
> + err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> }
> if (err)
> goto out_brelse;
> NEXT_ORPHAN(inode) = 0;
> + mutex_unlock(inode_orphan_mutex);
> err = ext4_mark_iloc_dirty(handle, inode, &iloc);
> -
> out_err:
> + if (i_next)
> + iput(i_next);
> + if (i_prev)
> + iput(i_prev);
> ext4_std_error(inode->i_sb, err);
> -out:
> - mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> return err;
>
> out_brelse:
> + mutex_unlock(inode_orphan_mutex);
> brelse(iloc.bh);
> goto out_err;
> }
> --
> 1.7.11.3
>


Cheers, Andreas






Subject: Re: [PATCH 2/2] fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex

On 10/03/2013 06:41 PM, Andreas Dilger wrote:

>> + struct inode *next_inode;
>
> Stack space in the kernel is not so abundant that all (or any?) of these
> should get their own local variable.
>
>>
>> - if (!EXT4_SB(sb)->s_journal)
>
> Same here.
>
>
> Cheers, Andreas

Thanks Andreas for the comments. On larger machines with processors with lots of register, with the compiler optimization I don't think it matters whether stack variables or repeated common expressions are used. On smaller machines with processors with limited number of registers, this will be a problem. I'll fix these on my rework.

Thanks,
Mak.

Subject: Re: [PATCH 1/2] fs/ext4: adding and initalizing new members of ext4_inode_info and ext4_sb_info

On 10/03/2013 06:37 PM, Andreas Dilger wrote:
> On 2013-10-02, at 9:36 AM, T Makphaibulchoke wrote:
>
> What do these additional fields do to the size of struct ext4_inode_info?
> I recall that Ted did a bunch of work to shrink this enough to fit nicely
> into a slab, and it would be a shame to increase the inode size to overflow
> the current packing and increase per-inode memory usage by 25-33%, for an
> improvement that is only noticeable on a 90-core machine.
>
> Is there another lock that could be shared for this that is unlikely to
> cause much contention?

Thanks for the suggestion. I was also thinking about this earlier, not sure if it's a good practice. Looks like it is way better than increasing the inode size. Will look into this in my rework.

>
> Also, it isn't clear to me why this patch is separate from 2/2, because
> all it does is add fields that are not used for anything. I don't think
> the 8 lines of code here are so complex that they can't be part of the
> same patch that is actually using them.
>
> Cheers, Andreas
>

I was debating whether to combine them into 1 or make them 2 patches. I'll combine them into one patch in my next submittal.

Thanks,
Mak.