2013-10-02 15:38:26

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH 0/2] fs/ext4: increase parallelism in updating ext4 orphan list

Instead of allowing only a single atomic update (both in memory and on disk
orphan lists) of an ext4's orphan list via the s_orphan_lock mutex, this patch
allows multiple updates of the orphan list, while still maintaing the
integrity of both the in memory and on disk orphan lists of each update.

This is accomplished by using a per inode mutex to serialize the oprhan
list update of a single inode, and a mutex and a spinlock to serailize
the on disk and in memory orphan list respectively.

Here are some of the becnhmark results with the changes.

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| alltests | 9.56 |
---------------------------
| custom | 12.20 |
---------------------------
| fserver | 15.99 |
---------------------------
| new_dbase | 1.73 |
---------------------------
| new_fserver | 17.56 |
---------------------------
| shared | 6.24 |
---------------------------
For Swingbench dss workload,

-------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
-------------------------------------------------------------------------
| % imprvoment | 7.67 | 9.43 | 7.30 | 0.58 | 0.53 |-2.62 |-3.72 | 3.77 |
| without using | | | | | | | | |
| shared memory | | | | | | | | |
-------------------------------------------------------------------------

On a 8 core machine:

Here are the performance date from some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| alltests | 3.90 |
---------------------------
| custom | 1.66 |
---------------------------
| dbase | -2.00 |
---------------------------
| fserver | 1.80 |
---------------------------
| new_dbase | -1.90 |
---------------------------
| new_fserver | 2.18 |
---------------------------
| shared | 7.46 |
---------------------------
For Swingbench dss workload,

-------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
-------------------------------------------------------------------------
| % imprvoment |-1.32 | 6.45 | 1.18 |-3.13 |-1.13 | 4.68 | 5.75 |-0.37 |
| without using | | | | | | | | |
| shared memory | | | | | | | | |
-------------------------------------------------------------------------

T Makphaibulchoke (2):
fs/ext4: adding and initalizing new members of ext4_inode_info and
ext4_sb_info
fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex

fs/ext4/ext4.h | 5 +-
fs/ext4/inode.c | 1 +
fs/ext4/namei.c | 139 ++++++++++++++++++++++++++++++++++++++++----------------
fs/ext4/super.c | 4 +-
4 files changed, 108 insertions(+), 41 deletions(-)

--
1.7.11.3



2013-10-04 00:28:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/2] fs/ext4: increase parallelism in updating ext4 orphan list

On 2013-10-02, at 9:38 AM, T Makphaibulchoke wrote:
> Instead of allowing only a single atomic update (both in memory and on disk
> orphan lists) of an ext4's orphan list via the s_orphan_lock mutex, this patch allows multiple updates of the orphan list, while still maintaing the
> integrity of both the in memory and on disk orphan lists of each update.
>
> This is accomplished by using a per inode mutex to serialize the oprhan
> list update of a single inode, and a mutex and a spinlock to serailize
> the on disk and in memory orphan list respectively.

It would also be possible to have a completely contention-free orphan
inode list by only generating the on-disk orphan linked list in a
pre-commit callback hook from an efficient in-memory list. That would
allow the common "add to orphan list; do something; remove from list"
operations within a single transaction to run with minimal contention,
and only the few rare cases of operations that exceed the lifetime of
a single transaction would need to modify the on-disk list.

For example, a per-cpu list would be quite efficient, or a hash table.
Then, a jbd2 callback run before the transaction commits could modify
the requisite inodes and superblock. All of those inodes are already
(by definition) part of the transaction, so it won't add new buffers
of the transaction.

I'm not necessarily against the current patch, just thinking aloud about
how it might be improved further.

Cheers, Andreas

> Here are some of the becnhmark results with the changes.
>
> On a 90 core machine:
>
> Here are the performance improvements in some of the aim7 workloads,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 9.56 |
> ---------------------------
> | custom | 12.20 |
> ---------------------------
> | fserver | 15.99 |
> ---------------------------
> | new_dbase | 1.73 |
> ---------------------------
> | new_fserver | 17.56 |
> ---------------------------
> | shared | 6.24 |
> ---------------------------
> For Swingbench dss workload,
>
> -------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
> -------------------------------------------------------------------------
> | % imprvoment | 7.67 | 9.43 | 7.30 | 0.58 | 0.53 |-2.62 |-3.72 | 3.77 |
> | without using | | | | | | | | |
> | shared memory | | | | | | | | |
> -------------------------------------------------------------------------
>
> On a 8 core machine:
>
> Here are the performance date from some of the aim7 workloads,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 3.90 |
> ---------------------------
> | custom | 1.66 |
> ---------------------------
> | dbase | -2.00 |
> ---------------------------
> | fserver | 1.80 |
> ---------------------------
> | new_dbase | -1.90 |
> ---------------------------
> | new_fserver | 2.18 |
> ---------------------------
> | shared | 7.46 |
> ---------------------------
> For Swingbench dss workload,
>
> -------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
> -------------------------------------------------------------------------
> | % imprvoment |-1.32 | 6.45 | 1.18 |-3.13 |-1.13 | 4.68 | 5.75 |-0.37 |
> | without using | | | | | | | | |
> | shared memory | | | | | | | | |
> -------------------------------------------------------------------------
>
> T Makphaibulchoke (2):
> fs/ext4: adding and initalizing new members of ext4_inode_info and
> ext4_sb_info
> fs/ext4/namei.c: reducing contention on s_orphan_lock mmutex
>
> fs/ext4/ext4.h | 5 +-
> fs/ext4/inode.c | 1 +
> fs/ext4/namei.c | 139 ++++++++++++++++++++++++++++++++++++++++----------------
> fs/ext4/super.c | 4 +-
> 4 files changed, 108 insertions(+), 41 deletions(-)
>
> --
> 1.7.11.3
>


Cheers, Andreas






Subject: Re: [PATCH 0/2] fs/ext4: increase parallelism in updating ext4 orphan list

On 10/03/2013 06:28 PM, Andreas Dilger wrote:
>
> It would also be possible to have a completely contention-free orphan
> inode list by only generating the on-disk orphan linked list in a
> pre-commit callback hook from an efficient in-memory list. That would
> allow the common "add to orphan list; do something; remove from list"
> operations within a single transaction to run with minimal contention,
> and only the few rare cases of operations that exceed the lifetime of
> a single transaction would need to modify the on-disk list.
>
> For example, a per-cpu list would be quite efficient, or a hash table.
> Then, a jbd2 callback run before the transaction commits could modify
> the requisite inodes and superblock. All of those inodes are already
> (by definition) part of the transaction, so it won't add new buffers
> of the transaction.
>
> I'm not necessarily against the current patch, just thinking aloud about
> how it might be improved further.
>
> Cheers, Andreas
>

Thanks again for the suggestion. I'll rework this patch first and look into this possibility next.

Thanks,
Mak.

2014-04-02 16:29:47

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

Instead of allowing only a single atomic update (both in memory and on disk
orphan lists) of an ext4's orphan list via the s_orphan_lock mutex, this patch
allows multiple updates of the orphan list, while still maintaing the
integrity of both the in memory and on disk orphan lists of each update.

This is accomplished by using a mutex from an array of mutexes, indexed by
hashing of an inode, to serialize oprhan list updates of a single inode, and
a mutex and a spinlock to serialize the on disk and in memory orphan list
respectively.

Here are some of the benchmark results with the changes.

On a 120 core machine,

---------------------------
| | % increase |
---------------------------
| alltests | 19.29 |
---------------------------
| custom | 11.10 |
---------------------------
| disk | 5.09 |
---------------------------
| fserver | 12.47 |
---------------------------
| new_dbase | 7.49 |
---------------------------
| shared | 16.55 |
---------------------------

On a 80 core machine,

---------------------------
| | % increase |
---------------------------
| alltests | 30.09 |
---------------------------
| custom | 22.66 |
---------------------------
| dbase | 3.28 |
---------------------------
| disk | 3.15 |
---------------------------
| fserver | 24.28 |
---------------------------
| high_systime| 6.79 |
---------------------------
| new_dbase | 4.63 |
---------------------------
| new_fserver | 28.40 |
---------------------------
| shared | 20.42 |
---------------------------

On a 8 core machine:

---------------------------
| | % increase |
---------------------------
| fserver | 9.11 |
---------------------------
| new_fserver | 3.45 |
---------------------------

For Swingbench dss workload,

-----------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-----------------------------------------------------------------------------
| % improve- | 6.15 | 5.49 | 3.13 | 1.06 | 2.31 | 6.29 | 6.50 |-0.6 | 1.72 |
| ment with | | | | | | | | | |
| out using | | | | | | | | | |
| /dev/shm | | | | | | | | | |
-----------------------------------------------------------------------------

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v2:
- New performance data
- Fixed problem in v1
- No changes in ext4_inode_info size
- Used an array of mutexes, indexed by hashing of an inode, to serialize
operations within a single inode
- Combined multiple patches into one.
---
fs/ext4/ext4.h | 5 ++-
fs/ext4/namei.c | 128 +++++++++++++++++++++++++++++++++++---------------------
fs/ext4/super.c | 12 +++++-
3 files changed, 95 insertions(+), 50 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a534f..cb962d6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -77,6 +77,7 @@
#define EXT4_ERROR_FILE(file, block, fmt, a...) \
ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a)

+#define EXT4_ORPHAN_OP_BITS 7
/* data type for block offset of block group */
typedef int ext4_grpblk_t;

@@ -1223,7 +1224,9 @@ struct ext4_sb_info {
/* Journaling */
struct journal_s *s_journal;
struct list_head s_orphan;
- struct mutex s_orphan_lock;
+ spinlock_t s_orphan_lock;
+ struct mutex s_ondisk_orphan_lock;
+ struct mutex *s_orphan_op_mutex;
unsigned long s_resize_flags; /* Flags indicating if there
is a resizer */
unsigned long s_commit_interval;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d050e04..c7db475 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -48,6 +48,13 @@
#define NAMEI_RA_BLOCKS 4
#define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)

+#define ORPHAN_OP_INDEX(ext4_i) \
+ (hash_long((unsigned long)ext4_i, EXT4_ORPHAN_OP_BITS))
+#define LOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
+ mutex_lock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
+#define UNLOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
+ mutex_unlock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
+
static struct buffer_head *ext4_append(handle_t *handle,
struct inode *inode,
ext4_lblk_t *block)
@@ -2556,7 +2563,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!EXT4_SB(sb)->s_journal)
return 0;

- mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
+ LOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
if (!list_empty(&EXT4_I(inode)->i_orphan))
goto out_unlock;

@@ -2582,9 +2589,21 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
* orphan list. If so skip on-disk list modification.
*/
if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
- (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
- goto mem_insert;
-
+ (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
+ brelse(iloc.bh);
+ spin_lock(&EXT4_SB(sb)->s_orphan_lock);
+ list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->
+ s_orphan);
+ spin_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ jbd_debug(4, "superblock will point to %lu\n",
+ inode->i_ino);
+ jbd_debug(4, "orphan inode %lu will point to %d\n",
+ inode->i_ino, NEXT_ORPHAN(inode));
+ goto out_unlock;
+ }
+
+ mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
+ spin_lock(&EXT4_SB(sb)->s_orphan_lock);
/* Insert this inode at the head of the on-disk orphan list... */
NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
@@ -2592,24 +2611,25 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
err = rc;
-
- /* Only add to the head of the in-memory list if all the
- * previous operations succeeded. If the orphan_add is going to
- * fail (possibly taking the journal offline), we can't risk
- * leaving the inode on the orphan list: stray orphan-list
- * entries can cause panics at unmount time.
- *
- * This is safe: on error we're going to ignore the orphan list
- * anyway on the next recovery. */
-mem_insert:
- if (!err)
+ if (!err) {
+ /* Only add to the head of the in-memory list if all the
+ * previous operations succeeded. If the orphan_add is going to
+ * fail (possibly taking the journal offline), we can't risk
+ * leaving the inode on the orphan list: stray orphan-list
+ * entries can cause panics at unmount time.
+ *
+ * This is safe: on error we're going to ignore the orphan list
+ * anyway on the next recovery. */
list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
-
- jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
- jbd_debug(4, "orphan inode %lu will point to %d\n",
+ spin_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
+ jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
+ }
+ mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
+
out_unlock:
- mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ UNLOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
ext4_std_error(inode->i_sb, err);
return err;
}
@@ -2622,45 +2642,56 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__u32 ino_next;
struct ext4_iloc iloc;
int err = 0;

- if ((!EXT4_SB(inode->i_sb)->s_journal) &&
- !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
+ if ((!sbi->s_journal) &&
+ !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;

- mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
- if (list_empty(&ei->i_orphan))
- goto out;
-
- ino_next = NEXT_ORPHAN(inode);
- prev = ei->i_orphan.prev;
- sbi = EXT4_SB(inode->i_sb);
-
- jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
+ LOCK_EXT4I_ORPHAN(sbi, ei);
+ if (list_empty(&ei->i_orphan)) {
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
+ return 0;
+ }

- list_del_init(&ei->i_orphan);

/* If we're on an error path, we may not have a valid
* transaction handle with which to update the orphan list on
* disk, but we still need to remove the inode from the linked
* list in memory. */
- if (!handle)
- goto out;
+ if (!handle) {
+ jbd_debug(4, "remove inode %lu from orphan list\n",
+ inode->i_ino);
+ spin_lock(&sbi->s_orphan_lock);
+ list_del_init(&ei->i_orphan);
+ spin_unlock(&sbi->s_orphan_lock);
+ } else
+ err = ext4_reserve_inode_write(handle, inode, &iloc);

- err = ext4_reserve_inode_write(handle, inode, &iloc);
- if (err)
- goto out_err;
+ if (!handle || err) {
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
+ goto out_set_err;
+ }

+ mutex_lock(&sbi->s_ondisk_orphan_lock);
+ ino_next = NEXT_ORPHAN(inode);
+ spin_lock(&sbi->s_orphan_lock);
+ prev = ei->i_orphan.prev;
+ jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
+ list_del_init(&ei->i_orphan);
+ spin_unlock(&sbi->s_orphan_lock);
if (prev == &sbi->s_orphan) {
jbd_debug(4, "superblock will point to %u\n", ino_next);
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
+ if (!err)
+ sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ mutex_unlock(&sbi->s_ondisk_orphan_lock);
if (err)
- goto out_brelse;
- sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ goto brelse;
err = ext4_handle_dirty_super(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
@@ -2670,25 +2701,26 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
jbd_debug(4, "orphan inode %lu will point to %u\n",
i_prev->i_ino, ino_next);
err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
+ if (!err)
+ NEXT_ORPHAN(i_prev) = ino_next;
+ mutex_unlock(&sbi->s_ondisk_orphan_lock);
if (err)
- goto out_brelse;
- NEXT_ORPHAN(i_prev) = ino_next;
+ goto brelse;
err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
}
if (err)
- goto out_brelse;
+ goto brelse;
NEXT_ORPHAN(inode) = 0;
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-
-out_err:
+out_set_err:
ext4_std_error(inode->i_sb, err);
-out:
- mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
return err;

-out_brelse:
+brelse:
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
brelse(iloc.bh);
- goto out_err;
+ goto out_set_err;
}

static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 710fed2..b8d4298 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3921,7 +3921,17 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));

INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
- mutex_init(&sbi->s_orphan_lock);
+ spin_lock_init(&sbi->s_orphan_lock);
+ mutex_init(&sbi->s_ondisk_orphan_lock);
+ sbi->s_orphan_op_mutex = kmalloc(sizeof(struct mutex) <<
+ EXT4_ORPHAN_OP_BITS, GFP_KERNEL);
+ BUG_ON(!sbi->s_orphan_op_mutex);
+ if (sbi->s_orphan_op_mutex) {
+ int n = 1 << EXT4_ORPHAN_OP_BITS;
+
+ while (n-- > 0)
+ mutex_init(&sbi->s_orphan_op_mutex[n]);
+ }

sb->s_root = NULL;

--
1.7.11.3


Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On 04/02/2014 11:41 AM, Jan Kara wrote:
> Thanks for the patches and measurements! So I agree we contend a lot on
> orphan list changes in ext4. But what you do seems to be unnecessarily
> complicated and somewhat hiding the real substance of the patch. If I
> understand your patch correctly, all it does is that it does the
> preliminary work (ext4_reserve_inode_write(),
> ext4_journal_get_write_access()) without the global orphan mutex (under the
> hashed mutex).
>

Thanks Jan for the comments. Yes, doing some of the preliminary work with grabbing the global mutex is part of the patch's strategy.

> However orphan operations on a single inode are already serialized by
> i_mutex so there's no need to introduce any new hashed lock. Just add
> assertion mutex_locked(&inode->i_mutex) to ext4_orphan_add() and
> ext4_orphan_del() - you might need to lock i_mutex around the code in
> fs/ext4/migrate.c and in ext4_tmpfile() but that should be fine.
>

As you pointed out, sounds like there may still be some code path that did not acquire the i_mutex. It probably would be better to acquire the i_mutex if it is not already acquired.

> Also I'm somewhat failing to see what the spinlock s_orphan_lock brings us.
> I'd guess that the mutex could still protect also the in-memory list and we
> have to grab it in all the relevant cases anyway (in some rare cases we
> could avoid taking the mutex and spinlock would be enough but these
> shouldn't be performance relevant). Please correct me if I'm wrong here, I
> didn't look at the code for that long.
>

Yes, you are correct. In the error or previous error case, we only need to update the in memory orphan list, which spinlock seems to be a better mechanism for serialization. Using a separate spinlock would also allow simultanoue operations of both the error and non-error cases. As you said, if this is a very rare case, it should not make much different. I'll rerun and ompare the benchmark results using a single mutex.

> Finally (and I somewhat miss this in your patch), I'd think we might need
> to use list_empty_careful() when checking inode's orphan list without
> global orphan list lock.
>
> Honza
>

Since we already serialize orphan operation with a single inode, the only race condition is an orphan operation on other inodes moving the inode within the orphan list. In this case head->next should not equal head. But yes, it is probably safer to use the list_empty_careful().

Thanks,
Mak.


2014-04-02 17:41:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On Wed 02-04-14 10:29:08, T Makphaibulchoke wrote:
> Instead of allowing only a single atomic update (both in memory and on disk
> orphan lists) of an ext4's orphan list via the s_orphan_lock mutex, this patch
> allows multiple updates of the orphan list, while still maintaing the
> integrity of both the in memory and on disk orphan lists of each update.
>
> This is accomplished by using a mutex from an array of mutexes, indexed by
> hashing of an inode, to serialize oprhan list updates of a single inode, and
> a mutex and a spinlock to serialize the on disk and in memory orphan list
> respectively.
Thanks for the patches and measurements! So I agree we contend a lot on
orphan list changes in ext4. But what you do seems to be unnecessarily
complicated and somewhat hiding the real substance of the patch. If I
understand your patch correctly, all it does is that it does the
preliminary work (ext4_reserve_inode_write(),
ext4_journal_get_write_access()) without the global orphan mutex (under the
hashed mutex).

However orphan operations on a single inode are already serialized by
i_mutex so there's no need to introduce any new hashed lock. Just add
assertion mutex_locked(&inode->i_mutex) to ext4_orphan_add() and
ext4_orphan_del() - you might need to lock i_mutex around the code in
fs/ext4/migrate.c and in ext4_tmpfile() but that should be fine.

Also I'm somewhat failing to see what the spinlock s_orphan_lock brings us.
I'd guess that the mutex could still protect also the in-memory list and we
have to grab it in all the relevant cases anyway (in some rare cases we
could avoid taking the mutex and spinlock would be enough but these
shouldn't be performance relevant). Please correct me if I'm wrong here, I
didn't look at the code for that long.

Finally (and I somewhat miss this in your patch), I'd think we might need
to use list_empty_careful() when checking inode's orphan list without
global orphan list lock.

Honza

>
> Here are some of the benchmark results with the changes.
>
> On a 120 core machine,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 19.29 |
> ---------------------------
> | custom | 11.10 |
> ---------------------------
> | disk | 5.09 |
> ---------------------------
> | fserver | 12.47 |
> ---------------------------
> | new_dbase | 7.49 |
> ---------------------------
> | shared | 16.55 |
> ---------------------------
>
> On a 80 core machine,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 30.09 |
> ---------------------------
> | custom | 22.66 |
> ---------------------------
> | dbase | 3.28 |
> ---------------------------
> | disk | 3.15 |
> ---------------------------
> | fserver | 24.28 |
> ---------------------------
> | high_systime| 6.79 |
> ---------------------------
> | new_dbase | 4.63 |
> ---------------------------
> | new_fserver | 28.40 |
> ---------------------------
> | shared | 20.42 |
> ---------------------------
>
> On a 8 core machine:
>
> ---------------------------
> | | % increase |
> ---------------------------
> | fserver | 9.11 |
> ---------------------------
> | new_fserver | 3.45 |
> ---------------------------
>
> For Swingbench dss workload,
>
> -----------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
> -----------------------------------------------------------------------------
> | % improve- | 6.15 | 5.49 | 3.13 | 1.06 | 2.31 | 6.29 | 6.50 |-0.6 | 1.72 |
> | ment with | | | | | | | | | |
> | out using | | | | | | | | | |
> | /dev/shm | | | | | | | | | |
> -----------------------------------------------------------------------------
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>
> ---
> Changed in v2:
> - New performance data
> - Fixed problem in v1
> - No changes in ext4_inode_info size
> - Used an array of mutexes, indexed by hashing of an inode, to serialize
> operations within a single inode
> - Combined multiple patches into one.
> ---
> fs/ext4/ext4.h | 5 ++-
> fs/ext4/namei.c | 128 +++++++++++++++++++++++++++++++++++---------------------
> fs/ext4/super.c | 12 +++++-
> 3 files changed, 95 insertions(+), 50 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d3a534f..cb962d6 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -77,6 +77,7 @@
> #define EXT4_ERROR_FILE(file, block, fmt, a...) \
> ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a)
>
> +#define EXT4_ORPHAN_OP_BITS 7
> /* data type for block offset of block group */
> typedef int ext4_grpblk_t;
>
> @@ -1223,7 +1224,9 @@ struct ext4_sb_info {
> /* Journaling */
> struct journal_s *s_journal;
> struct list_head s_orphan;
> - struct mutex s_orphan_lock;
> + spinlock_t s_orphan_lock;
> + struct mutex s_ondisk_orphan_lock;
> + struct mutex *s_orphan_op_mutex;
> unsigned long s_resize_flags; /* Flags indicating if there
> is a resizer */
> unsigned long s_commit_interval;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index d050e04..c7db475 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -48,6 +48,13 @@
> #define NAMEI_RA_BLOCKS 4
> #define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
>
> +#define ORPHAN_OP_INDEX(ext4_i) \
> + (hash_long((unsigned long)ext4_i, EXT4_ORPHAN_OP_BITS))
> +#define LOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
> + mutex_lock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
> +#define UNLOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
> + mutex_unlock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
> +
> static struct buffer_head *ext4_append(handle_t *handle,
> struct inode *inode,
> ext4_lblk_t *block)
> @@ -2556,7 +2563,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> if (!EXT4_SB(sb)->s_journal)
> return 0;
>
> - mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> + LOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
> if (!list_empty(&EXT4_I(inode)->i_orphan))
> goto out_unlock;
>
> @@ -2582,9 +2589,21 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> * orphan list. If so skip on-disk list modification.
> */
> if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> - (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
> - goto mem_insert;
> -
> + (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
> + brelse(iloc.bh);
> + spin_lock(&EXT4_SB(sb)->s_orphan_lock);
> + list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->
> + s_orphan);
> + spin_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + jbd_debug(4, "superblock will point to %lu\n",
> + inode->i_ino);
> + jbd_debug(4, "orphan inode %lu will point to %d\n",
> + inode->i_ino, NEXT_ORPHAN(inode));
> + goto out_unlock;
> + }
> +
> + mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> + spin_lock(&EXT4_SB(sb)->s_orphan_lock);
> /* Insert this inode at the head of the on-disk orphan list... */
> NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> @@ -2592,24 +2611,25 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> if (!err)
> err = rc;
> -
> - /* Only add to the head of the in-memory list if all the
> - * previous operations succeeded. If the orphan_add is going to
> - * fail (possibly taking the journal offline), we can't risk
> - * leaving the inode on the orphan list: stray orphan-list
> - * entries can cause panics at unmount time.
> - *
> - * This is safe: on error we're going to ignore the orphan list
> - * anyway on the next recovery. */
> -mem_insert:
> - if (!err)
> + if (!err) {
> + /* Only add to the head of the in-memory list if all the
> + * previous operations succeeded. If the orphan_add is going to
> + * fail (possibly taking the journal offline), we can't risk
> + * leaving the inode on the orphan list: stray orphan-list
> + * entries can cause panics at unmount time.
> + *
> + * This is safe: on error we're going to ignore the orphan list
> + * anyway on the next recovery. */
> list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
> -
> - jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> - jbd_debug(4, "orphan inode %lu will point to %d\n",
> + spin_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> + jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> + }
> + mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> +
> out_unlock:
> - mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + UNLOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
> ext4_std_error(inode->i_sb, err);
> return err;
> }
> @@ -2622,45 +2642,56 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> {
> struct list_head *prev;
> struct ext4_inode_info *ei = EXT4_I(inode);
> - struct ext4_sb_info *sbi;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> __u32 ino_next;
> struct ext4_iloc iloc;
> int err = 0;
>
> - if ((!EXT4_SB(inode->i_sb)->s_journal) &&
> - !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
> + if ((!sbi->s_journal) &&
> + !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> - if (list_empty(&ei->i_orphan))
> - goto out;
> -
> - ino_next = NEXT_ORPHAN(inode);
> - prev = ei->i_orphan.prev;
> - sbi = EXT4_SB(inode->i_sb);
> -
> - jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> + LOCK_EXT4I_ORPHAN(sbi, ei);
> + if (list_empty(&ei->i_orphan)) {
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> + return 0;
> + }
>
> - list_del_init(&ei->i_orphan);
>
> /* If we're on an error path, we may not have a valid
> * transaction handle with which to update the orphan list on
> * disk, but we still need to remove the inode from the linked
> * list in memory. */
> - if (!handle)
> - goto out;
> + if (!handle) {
> + jbd_debug(4, "remove inode %lu from orphan list\n",
> + inode->i_ino);
> + spin_lock(&sbi->s_orphan_lock);
> + list_del_init(&ei->i_orphan);
> + spin_unlock(&sbi->s_orphan_lock);
> + } else
> + err = ext4_reserve_inode_write(handle, inode, &iloc);
>
> - err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> - goto out_err;
> + if (!handle || err) {
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> + goto out_set_err;
> + }
>
> + mutex_lock(&sbi->s_ondisk_orphan_lock);
> + ino_next = NEXT_ORPHAN(inode);
> + spin_lock(&sbi->s_orphan_lock);
> + prev = ei->i_orphan.prev;
> + jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> + list_del_init(&ei->i_orphan);
> + spin_unlock(&sbi->s_orphan_lock);
> if (prev == &sbi->s_orphan) {
> jbd_debug(4, "superblock will point to %u\n", ino_next);
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
> err = ext4_journal_get_write_access(handle, sbi->s_sbh);
> + if (!err)
> + sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + mutex_unlock(&sbi->s_ondisk_orphan_lock);
> if (err)
> - goto out_brelse;
> - sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + goto brelse;
> err = ext4_handle_dirty_super(handle, inode->i_sb);
> } else {
> struct ext4_iloc iloc2;
> @@ -2670,25 +2701,26 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> jbd_debug(4, "orphan inode %lu will point to %u\n",
> i_prev->i_ino, ino_next);
> err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> + if (!err)
> + NEXT_ORPHAN(i_prev) = ino_next;
> + mutex_unlock(&sbi->s_ondisk_orphan_lock);
> if (err)
> - goto out_brelse;
> - NEXT_ORPHAN(i_prev) = ino_next;
> + goto brelse;
> err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> }
> if (err)
> - goto out_brelse;
> + goto brelse;
> NEXT_ORPHAN(inode) = 0;
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> err = ext4_mark_iloc_dirty(handle, inode, &iloc);
> -
> -out_err:
> +out_set_err:
> ext4_std_error(inode->i_sb, err);
> -out:
> - mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> return err;
>
> -out_brelse:
> +brelse:
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> brelse(iloc.bh);
> - goto out_err;
> + goto out_set_err;
> }
>
> static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 710fed2..b8d4298 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3921,7 +3921,17 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
>
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> - mutex_init(&sbi->s_orphan_lock);
> + spin_lock_init(&sbi->s_orphan_lock);
> + mutex_init(&sbi->s_ondisk_orphan_lock);
> + sbi->s_orphan_op_mutex = kmalloc(sizeof(struct mutex) <<
> + EXT4_ORPHAN_OP_BITS, GFP_KERNEL);
> + BUG_ON(!sbi->s_orphan_op_mutex);
> + if (sbi->s_orphan_op_mutex) {
> + int n = 1 << EXT4_ORPHAN_OP_BITS;
> +
> + while (n-- > 0)
> + mutex_init(&sbi->s_orphan_op_mutex[n]);
> + }
>
> sb->s_root = NULL;
>
> --
> 1.7.11.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On 04/02/2014 11:41 AM, Jan Kara wrote:
> Thanks for the patches and measurements! So I agree we contend a lot on
> orphan list changes in ext4. But what you do seems to be unnecessarily
> complicated and somewhat hiding the real substance of the patch. If I
> understand your patch correctly, all it does is that it does the
> preliminary work (ext4_reserve_inode_write(),
> ext4_journal_get_write_access()) without the global orphan mutex (under the
> hashed mutex).
>
> However orphan operations on a single inode are already serialized by
> i_mutex so there's no need to introduce any new hashed lock. Just add
> assertion mutex_locked(&inode->i_mutex) to ext4_orphan_add() and
> ext4_orphan_del() - you might need to lock i_mutex around the code in
> fs/ext4/migrate.c and in ext4_tmpfile() but that should be fine.
>

I've done some data collection. Looks like there are more callers to ext4_orphan_add() and ext4_orphan_del() without holding the i_mutex as expected.

This is what I've found on one of my 8 core machine.

--------------------------------------------
| ext4_orphan_add | ext4_orphan_del |
--------------------------------------------
| Total | without | Total | without |
| | holding | | holding |
| | i_mutex | | i_mutex |
-------------------------------------------------------------
| First comes up | | | | |
| to multi-user | 2446 | 363 | 2081 | 1659 |
-------------------------------------------------------------
| After alltests | 23812582 | 173599 | 23853863 | 2467999 |
-------------------------------------------------------------
| After disk | 30860981 | 818255 | 30261069 | 8881905 |
-------------------------------------------------------------

Though some orphan calls already held the i_mutex, using the i_mutex to serialize orphan operations within a single inode seems to negate all of the performance improvement from the original patch. There seems to be no performance differences form the current implementation.

> Also I'm somewhat failing to see what the spinlock s_orphan_lock brings us.
> I'd guess that the mutex could still protect also the in-memory list and we
> have to grab it in all the relevant cases anyway (in some rare cases we
> could avoid taking the mutex and spinlock would be enough but these
> shouldn't be performance relevant). Please correct me if I'm wrong here, I
> didn't look at the code for that long.
>

The same data also shows 0 call from the error part. Re-running the benchmark, replacing the spinlock with the same disk oprhan mutex, does not seem to have any performance impact from that of the original patch.

I'll resubmit the patch by just removing the in memory orphan list lock and keeping the mutex array for serializing orphan operation within a single inode.

Please let me know if you have any additional comment or concern.

Thanks,
Mak.


> Finally (and I somewhat miss this in your patch), I'd think we might need
> to use list_empty_careful() when checking inode's orphan list without
> global orphan list lock.
>
> Honza
>

2014-04-14 17:40:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On Mon 14-04-14 10:56:58, Thavatchai Makphaibulchoke wrote:
> On 04/02/2014 11:41 AM, Jan Kara wrote:
> > Thanks for the patches and measurements! So I agree we contend a lot on
> > orphan list changes in ext4. But what you do seems to be unnecessarily
> > complicated and somewhat hiding the real substance of the patch. If I
> > understand your patch correctly, all it does is that it does the
> > preliminary work (ext4_reserve_inode_write(),
> > ext4_journal_get_write_access()) without the global orphan mutex (under the
> > hashed mutex).
> >
> > However orphan operations on a single inode are already serialized by
> > i_mutex so there's no need to introduce any new hashed lock. Just add
> > assertion mutex_locked(&inode->i_mutex) to ext4_orphan_add() and
> > ext4_orphan_del() - you might need to lock i_mutex around the code in
> > fs/ext4/migrate.c and in ext4_tmpfile() but that should be fine.
> >
>
> I've done some data collection. Looks like there are more callers to
> ext4_orphan_add() and ext4_orphan_del() without holding the i_mutex as
> expected.
>
> This is what I've found on one of my 8 core machine.
>
> --------------------------------------------
> | ext4_orphan_add | ext4_orphan_del |
> --------------------------------------------
> | Total | without | Total | without |
> | | holding | | holding |
> | | i_mutex | | i_mutex |
> -------------------------------------------------------------
> | First comes up | | | | |
> | to multi-user | 2446 | 363 | 2081 | 1659 |
> -------------------------------------------------------------
> | After alltests | 23812582 | 173599 | 23853863 | 2467999 |
> -------------------------------------------------------------
> | After disk | 30860981 | 818255 | 30261069 | 8881905 |
> -------------------------------------------------------------
>
> Though some orphan calls already held the i_mutex, using the i_mutex to
> serialize orphan operations within a single inode seems to negate all of
> the performance improvement from the original patch. There seems to be
> no performance differences form the current implementation.
Thanks for trying that out! Can you please send me a patch you have been
testing? Because it doesn't quite make sense to me why using i_mutex should
be worse than using hashed locks...

> > Also I'm somewhat failing to see what the spinlock s_orphan_lock brings us.
> > I'd guess that the mutex could still protect also the in-memory list and we
> > have to grab it in all the relevant cases anyway (in some rare cases we
> > could avoid taking the mutex and spinlock would be enough but these
> > shouldn't be performance relevant). Please correct me if I'm wrong here, I
> > didn't look at the code for that long.
> >
>
> The same data also shows 0 call from the error part. Re-running the
> benchmark, replacing the spinlock with the same disk oprhan mutex, does
> not seem to have any performance impact from that of the original patch.
OK, at least that makes sense.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On 04/14/2014 11:40 AM, Jan Kara wrote:
> Thanks for trying that out! Can you please send me a patch you have been
> testing? Because it doesn't quite make sense to me why using i_mutex should
> be worse than using hashed locks...
>

Thanks again for the comments.

Since i_mutex is also used for serialization in other operations on an inode, in the case that the i_mutex is not held using it for serialization could cause contention with other operations on the inode. As the number shows substantial instances of orphan add or delete calls without holding the i_mutex, I presume the performance degradation is due to the contention.

As for the patch, could you please let me know if you need the patch using i_mutex or the patch I'm planning to submit. If it's the latter, I'm thinking of go ahead and resubmit it.

Thanks,
Mak.

> OK, at least that makes sense.
>
> Honza
>

2014-04-15 17:25:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On Tue 15-04-14 10:27:46, Thavatchai Makphaibulchoke wrote:
> On 04/14/2014 11:40 AM, Jan Kara wrote:
> > Thanks for trying that out! Can you please send me a patch you have been
> > testing? Because it doesn't quite make sense to me why using i_mutex should
> > be worse than using hashed locks...
> >
>
> Thanks again for the comments.
>
> Since i_mutex is also used for serialization in other operations on an
> inode, in the case that the i_mutex is not held using it for
> serialization could cause contention with other operations on the inode.
> As the number shows substantial instances of orphan add or delete calls
> without holding the i_mutex, I presume the performance degradation is due
> to the contention.
I have checked the source and I didn't find many places where i_mutex was
not held. But maybe I'm wrong. That's why I wanted to see the patch where
you are using i_mutex instead of hashed mutexes and which didn't perform
good enough.

> As for the patch, could you please let me know if you need the patch
> using i_mutex or the patch I'm planning to submit. If it's the latter,
> I'm thinking of go ahead and resubmit it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [PATCH v2] fs/ext4: increase parallelism in updating ext4 orphan list

On 04/15/2014 11:25 AM, Jan Kara wrote:
> I have checked the source and I didn't find many places where i_mutex was
> not held. But maybe I'm wrong. That's why I wanted to see the patch where
> you are using i_mutex instead of hashed mutexes and which didn't perform
> good enough.
>

I've attached two patches. The first one, 0001-Orphan-patch-using-i_mutex-and-removing-s_orphan_loc.patch, is the one you requested, using i_mutex and removing the s_orphan_lock. The second one, 0001-Adding-code-to-collect-i_mutex-usage-during-orphan.patch, with the code to collect statistics on the number of orphan operations with and without holding i_mutex, in case you are also interested, can be applied on top of the first patch.

Please note that these two patches is meant for data collection only, as the code is quite of submittal quality.

Please also let me know if you have any further comments or suggestion. I'll hold submitting for a couple more days.


>
> Honza
>

Thanks,
Mak.




Attachments:
0001-Orphan-patch-using-i_mutex-and-removing-s_orphan_loc.patch (8.45 kB)
0001-Adding-code-to-collect-i_mutex-usage-during-orphan.patch (3.36 kB)
Download all attachments

2014-04-24 17:31:26

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v3] fs/ext4: increase parallelism in updating ext4 orphan list

This patch allows more concurrency of updates of ext4 orphan list,
while still maintaining the integrity of both the in memory and on
disk orphan lists of each update.

This is accomplished by using a mutex from an array of mutexes, indexed
by hashing of an inode, to serialize orphan list updates of a single
inode, allowing most prelimary work to be done prior to acquiring the mutex.
The mutex is acuqired only during the update of both orphan lists,
reducing its contention.

Here are some of the benchmark results with the changes.

On a 120 core machine,

---------------------------
| | % increase |
---------------------------
| alltests | 19.29 |
---------------------------
| custom | 11.10 |
---------------------------
| disk | 5.09 |
---------------------------
| fserver | 12.47 |
---------------------------
| new_dbase | 7.49 |
---------------------------
| shared | 16.55 |
---------------------------

On a 80 core machine,

---------------------------
| | % increase |
---------------------------
| alltests | 30.09 |
---------------------------
| custom | 22.66 |
---------------------------
| dbase | 3.28 |
---------------------------
| disk | 3.15 |
---------------------------
| fserver | 24.28 |
---------------------------
| high_systime| 6.79 |
---------------------------
| new_dbase | 4.63 |
---------------------------
| new_fserver | 28.40 |
---------------------------
| shared | 20.42 |
---------------------------

On a 8 core machine:

---------------------------
| | % increase |
---------------------------
| fserver | 9.11 |
---------------------------
| new_fserver | 3.45 |
---------------------------

For Swingbench dss workload,

-----------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-----------------------------------------------------------------------------
| % improve- | 6.15 | 5.49 | 3.13 | 1.06 | 2.31 | 6.29 | 6.50 |-0.6 | 1.72 |
| ment with | | | | | | | | | |
| out using | | | | | | | | | |
| /dev/shm | | | | | | | | | |
-----------------------------------------------------------------------------

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v3:
- Changed patch description
- Reverted back to using a single mutex, s_ondisk_orphan_mutex, for
both on disk and in memory orphan lists serialization.

Changed in v2:
- New performance data
- Fixed problem in v1
- No changes in ext4_inode_info size
- Used an array of mutexes, indexed by hashing of an inode, to serialize
operations within a single inode
- Combined multiple patches into one.
---
fs/ext4/ext4.h | 4 +-
fs/ext4/namei.c | 126 ++++++++++++++++++++++++++++++++++----------------------
fs/ext4/super.c | 11 ++++-
3 files changed, 90 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a534f..a348f7c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -77,6 +77,7 @@
#define EXT4_ERROR_FILE(file, block, fmt, a...) \
ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a)

+#define EXT4_ORPHAN_OP_BITS 7
/* data type for block offset of block group */
typedef int ext4_grpblk_t;

@@ -1223,7 +1224,8 @@ struct ext4_sb_info {
/* Journaling */
struct journal_s *s_journal;
struct list_head s_orphan;
- struct mutex s_orphan_lock;
+ struct mutex s_ondisk_orphan_lock;
+ struct mutex *s_orphan_op_mutex;
unsigned long s_resize_flags; /* Flags indicating if there
is a resizer */
unsigned long s_commit_interval;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d050e04..b062e1e 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -48,6 +48,13 @@
#define NAMEI_RA_BLOCKS 4
#define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)

+#define ORPHAN_OP_INDEX(ext4_i) \
+ (hash_long((unsigned long)ext4_i, EXT4_ORPHAN_OP_BITS))
+#define LOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
+ mutex_lock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
+#define UNLOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
+ mutex_unlock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
+
static struct buffer_head *ext4_append(handle_t *handle,
struct inode *inode,
ext4_lblk_t *block)
@@ -2556,8 +2563,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!EXT4_SB(sb)->s_journal)
return 0;

- mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
- if (!list_empty(&EXT4_I(inode)->i_orphan))
+ LOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
+ if (!list_empty_careful(&EXT4_I(inode)->i_orphan))
goto out_unlock;

/*
@@ -2582,9 +2589,20 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
* orphan list. If so skip on-disk list modification.
*/
if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
- (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
- goto mem_insert;
-
+ (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
+ brelse(iloc.bh);
+ mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
+ list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->
+ s_orphan);
+ mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
+ jbd_debug(4, "superblock will point to %lu\n",
+ inode->i_ino);
+ jbd_debug(4, "orphan inode %lu will point to %d\n",
+ inode->i_ino, NEXT_ORPHAN(inode));
+ goto out_unlock;
+ }
+
+ mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
/* Insert this inode at the head of the on-disk orphan list... */
NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
@@ -2592,24 +2610,24 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
err = rc;
-
- /* Only add to the head of the in-memory list if all the
- * previous operations succeeded. If the orphan_add is going to
- * fail (possibly taking the journal offline), we can't risk
- * leaving the inode on the orphan list: stray orphan-list
- * entries can cause panics at unmount time.
- *
- * This is safe: on error we're going to ignore the orphan list
- * anyway on the next recovery. */
-mem_insert:
- if (!err)
+ if (!err) {
+ /* Only add to the head of the in-memory list if all the
+ * previous operations succeeded. If the orphan_add is going to
+ * fail (possibly taking the journal offline), we can't risk
+ * leaving the inode on the orphan list: stray orphan-list
+ * entries can cause panics at unmount time.
+ *
+ * This is safe: on error we're going to ignore the orphan list
+ * anyway on the next recovery. */
list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
-
- jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
- jbd_debug(4, "orphan inode %lu will point to %d\n",
+ jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
+ jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
+ }
+ mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
+
out_unlock:
- mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ UNLOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
ext4_std_error(inode->i_sb, err);
return err;
}
@@ -2622,45 +2640,54 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
{
struct list_head *prev;
struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__u32 ino_next;
struct ext4_iloc iloc;
int err = 0;

- if ((!EXT4_SB(inode->i_sb)->s_journal) &&
- !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
+ if ((!sbi->s_journal) &&
+ !(sbi->s_mount_state & EXT4_ORPHAN_FS))
return 0;

- mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
- if (list_empty(&ei->i_orphan))
- goto out;
-
- ino_next = NEXT_ORPHAN(inode);
- prev = ei->i_orphan.prev;
- sbi = EXT4_SB(inode->i_sb);
-
- jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
+ LOCK_EXT4I_ORPHAN(sbi, ei);
+ if (list_empty_careful(&ei->i_orphan)) {
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
+ return 0;
+ }

- list_del_init(&ei->i_orphan);

/* If we're on an error path, we may not have a valid
* transaction handle with which to update the orphan list on
* disk, but we still need to remove the inode from the linked
* list in memory. */
- if (!handle)
- goto out;
+ if (!handle) {
+ jbd_debug(4, "remove inode %lu from orphan list\n",
+ inode->i_ino);
+ mutex_lock(&sbi->s_ondisk_orphan_lock);
+ list_del_init(&ei->i_orphan);
+ mutex_unlock(&sbi->s_ondisk_orphan_lock);
+ } else
+ err = ext4_reserve_inode_write(handle, inode, &iloc);

- err = ext4_reserve_inode_write(handle, inode, &iloc);
- if (err)
- goto out_err;
+ if (!handle || err) {
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
+ goto out_set_err;
+ }

+ mutex_lock(&sbi->s_ondisk_orphan_lock);
+ ino_next = NEXT_ORPHAN(inode);
+ prev = ei->i_orphan.prev;
+ jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
+ list_del_init(&ei->i_orphan);
if (prev == &sbi->s_orphan) {
jbd_debug(4, "superblock will point to %u\n", ino_next);
BUFFER_TRACE(sbi->s_sbh, "get_write_access");
err = ext4_journal_get_write_access(handle, sbi->s_sbh);
+ if (!err)
+ sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ mutex_unlock(&sbi->s_ondisk_orphan_lock);
if (err)
- goto out_brelse;
- sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
+ goto brelse;
err = ext4_handle_dirty_super(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
@@ -2670,25 +2697,26 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
jbd_debug(4, "orphan inode %lu will point to %u\n",
i_prev->i_ino, ino_next);
err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
+ if (!err)
+ NEXT_ORPHAN(i_prev) = ino_next;
+ mutex_unlock(&sbi->s_ondisk_orphan_lock);
if (err)
- goto out_brelse;
- NEXT_ORPHAN(i_prev) = ino_next;
+ goto brelse;
err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
}
if (err)
- goto out_brelse;
+ goto brelse;
NEXT_ORPHAN(inode) = 0;
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
-
-out_err:
+out_set_err:
ext4_std_error(inode->i_sb, err);
-out:
- mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
return err;

-out_brelse:
+brelse:
+ UNLOCK_EXT4I_ORPHAN(sbi, ei);
brelse(iloc.bh);
- goto out_err;
+ goto out_set_err;
}

static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 710fed2..a4275d1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3921,7 +3921,16 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));

INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
- mutex_init(&sbi->s_orphan_lock);
+ mutex_init(&sbi->s_ondisk_orphan_lock);
+ sbi->s_orphan_op_mutex = kmalloc(sizeof(struct mutex) <<
+ EXT4_ORPHAN_OP_BITS, GFP_KERNEL);
+ BUG_ON(!sbi->s_orphan_op_mutex);
+ if (sbi->s_orphan_op_mutex) {
+ int n = 1 << EXT4_ORPHAN_OP_BITS;
+
+ while (n-- > 0)
+ mutex_init(&sbi->s_orphan_op_mutex[n]);
+ }

sb->s_root = NULL;

--
1.7.11.3

2014-04-30 10:10:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3] fs/ext4: increase parallelism in updating ext4 orphan list

On Thu 24-04-14 11:31:26, T Makphaibulchoke wrote:
> This patch allows more concurrency of updates of ext4 orphan list,
> while still maintaining the integrity of both the in memory and on
> disk orphan lists of each update.
>
> This is accomplished by using a mutex from an array of mutexes, indexed
> by hashing of an inode, to serialize orphan list updates of a single
> inode, allowing most prelimary work to be done prior to acquiring the mutex.
> The mutex is acuqired only during the update of both orphan lists,
> reducing its contention.
>
> Here are some of the benchmark results with the changes.
>
> On a 120 core machine,
Just for record I think we can just remove the hashed locks in your patch
and things should work - see patches I've submitted yesterday. But please
check whether you see similar performance numbers with them as I may have
missed some aspect of your patch.

Honza
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 19.29 |
> ---------------------------
> | custom | 11.10 |
> ---------------------------
> | disk | 5.09 |
> ---------------------------
> | fserver | 12.47 |
> ---------------------------
> | new_dbase | 7.49 |
> ---------------------------
> | shared | 16.55 |
> ---------------------------
>
> On a 80 core machine,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 30.09 |
> ---------------------------
> | custom | 22.66 |
> ---------------------------
> | dbase | 3.28 |
> ---------------------------
> | disk | 3.15 |
> ---------------------------
> | fserver | 24.28 |
> ---------------------------
> | high_systime| 6.79 |
> ---------------------------
> | new_dbase | 4.63 |
> ---------------------------
> | new_fserver | 28.40 |
> ---------------------------
> | shared | 20.42 |
> ---------------------------
>
> On a 8 core machine:
>
> ---------------------------
> | | % increase |
> ---------------------------
> | fserver | 9.11 |
> ---------------------------
> | new_fserver | 3.45 |
> ---------------------------
>
> For Swingbench dss workload,
>
> -----------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
> -----------------------------------------------------------------------------
> | % improve- | 6.15 | 5.49 | 3.13 | 1.06 | 2.31 | 6.29 | 6.50 |-0.6 | 1.72 |
> | ment with | | | | | | | | | |
> | out using | | | | | | | | | |
> | /dev/shm | | | | | | | | | |
> -----------------------------------------------------------------------------
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>
> ---
> Changed in v3:
> - Changed patch description
> - Reverted back to using a single mutex, s_ondisk_orphan_mutex, for
> both on disk and in memory orphan lists serialization.
>
> Changed in v2:
> - New performance data
> - Fixed problem in v1
> - No changes in ext4_inode_info size
> - Used an array of mutexes, indexed by hashing of an inode, to serialize
> operations within a single inode
> - Combined multiple patches into one.
> ---
> fs/ext4/ext4.h | 4 +-
> fs/ext4/namei.c | 126 ++++++++++++++++++++++++++++++++++----------------------
> fs/ext4/super.c | 11 ++++-
> 3 files changed, 90 insertions(+), 51 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d3a534f..a348f7c 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -77,6 +77,7 @@
> #define EXT4_ERROR_FILE(file, block, fmt, a...) \
> ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a)
>
> +#define EXT4_ORPHAN_OP_BITS 7
> /* data type for block offset of block group */
> typedef int ext4_grpblk_t;
>
> @@ -1223,7 +1224,8 @@ struct ext4_sb_info {
> /* Journaling */
> struct journal_s *s_journal;
> struct list_head s_orphan;
> - struct mutex s_orphan_lock;
> + struct mutex s_ondisk_orphan_lock;
> + struct mutex *s_orphan_op_mutex;
> unsigned long s_resize_flags; /* Flags indicating if there
> is a resizer */
> unsigned long s_commit_interval;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index d050e04..b062e1e 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -48,6 +48,13 @@
> #define NAMEI_RA_BLOCKS 4
> #define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
>
> +#define ORPHAN_OP_INDEX(ext4_i) \
> + (hash_long((unsigned long)ext4_i, EXT4_ORPHAN_OP_BITS))
> +#define LOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
> + mutex_lock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
> +#define UNLOCK_EXT4I_ORPHAN(ext4_sb, ext4_i) \
> + mutex_unlock(&ext4_sb->s_orphan_op_mutex[ORPHAN_OP_INDEX(ext4_i)])
> +
> static struct buffer_head *ext4_append(handle_t *handle,
> struct inode *inode,
> ext4_lblk_t *block)
> @@ -2556,8 +2563,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> if (!EXT4_SB(sb)->s_journal)
> return 0;
>
> - mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> - if (!list_empty(&EXT4_I(inode)->i_orphan))
> + LOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
> + if (!list_empty_careful(&EXT4_I(inode)->i_orphan))
> goto out_unlock;
>
> /*
> @@ -2582,9 +2589,20 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> * orphan list. If so skip on-disk list modification.
> */
> if (NEXT_ORPHAN(inode) && NEXT_ORPHAN(inode) <=
> - (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)))
> - goto mem_insert;
> -
> + (le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) {
> + brelse(iloc.bh);
> + mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> + list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->
> + s_orphan);
> + mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> + jbd_debug(4, "superblock will point to %lu\n",
> + inode->i_ino);
> + jbd_debug(4, "orphan inode %lu will point to %d\n",
> + inode->i_ino, NEXT_ORPHAN(inode));
> + goto out_unlock;
> + }
> +
> + mutex_lock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> /* Insert this inode at the head of the on-disk orphan list... */
> NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> @@ -2592,24 +2610,24 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> if (!err)
> err = rc;
> -
> - /* Only add to the head of the in-memory list if all the
> - * previous operations succeeded. If the orphan_add is going to
> - * fail (possibly taking the journal offline), we can't risk
> - * leaving the inode on the orphan list: stray orphan-list
> - * entries can cause panics at unmount time.
> - *
> - * This is safe: on error we're going to ignore the orphan list
> - * anyway on the next recovery. */
> -mem_insert:
> - if (!err)
> + if (!err) {
> + /* Only add to the head of the in-memory list if all the
> + * previous operations succeeded. If the orphan_add is going to
> + * fail (possibly taking the journal offline), we can't risk
> + * leaving the inode on the orphan list: stray orphan-list
> + * entries can cause panics at unmount time.
> + *
> + * This is safe: on error we're going to ignore the orphan list
> + * anyway on the next recovery. */
> list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
> -
> - jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> - jbd_debug(4, "orphan inode %lu will point to %d\n",
> + jbd_debug(4, "superblock will point to %lu\n", inode->i_ino);
> + jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> + }
> + mutex_unlock(&EXT4_SB(sb)->s_ondisk_orphan_lock);
> +
> out_unlock:
> - mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
> + UNLOCK_EXT4I_ORPHAN(EXT4_SB(sb), EXT4_I(inode));
> ext4_std_error(inode->i_sb, err);
> return err;
> }
> @@ -2622,45 +2640,54 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> {
> struct list_head *prev;
> struct ext4_inode_info *ei = EXT4_I(inode);
> - struct ext4_sb_info *sbi;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> __u32 ino_next;
> struct ext4_iloc iloc;
> int err = 0;
>
> - if ((!EXT4_SB(inode->i_sb)->s_journal) &&
> - !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS))
> + if ((!sbi->s_journal) &&
> + !(sbi->s_mount_state & EXT4_ORPHAN_FS))
> return 0;
>
> - mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> - if (list_empty(&ei->i_orphan))
> - goto out;
> -
> - ino_next = NEXT_ORPHAN(inode);
> - prev = ei->i_orphan.prev;
> - sbi = EXT4_SB(inode->i_sb);
> -
> - jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> + LOCK_EXT4I_ORPHAN(sbi, ei);
> + if (list_empty_careful(&ei->i_orphan)) {
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> + return 0;
> + }
>
> - list_del_init(&ei->i_orphan);
>
> /* If we're on an error path, we may not have a valid
> * transaction handle with which to update the orphan list on
> * disk, but we still need to remove the inode from the linked
> * list in memory. */
> - if (!handle)
> - goto out;
> + if (!handle) {
> + jbd_debug(4, "remove inode %lu from orphan list\n",
> + inode->i_ino);
> + mutex_lock(&sbi->s_ondisk_orphan_lock);
> + list_del_init(&ei->i_orphan);
> + mutex_unlock(&sbi->s_ondisk_orphan_lock);
> + } else
> + err = ext4_reserve_inode_write(handle, inode, &iloc);
>
> - err = ext4_reserve_inode_write(handle, inode, &iloc);
> - if (err)
> - goto out_err;
> + if (!handle || err) {
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> + goto out_set_err;
> + }
>
> + mutex_lock(&sbi->s_ondisk_orphan_lock);
> + ino_next = NEXT_ORPHAN(inode);
> + prev = ei->i_orphan.prev;
> + jbd_debug(4, "remove inode %lu from orphan list\n", inode->i_ino);
> + list_del_init(&ei->i_orphan);
> if (prev == &sbi->s_orphan) {
> jbd_debug(4, "superblock will point to %u\n", ino_next);
> BUFFER_TRACE(sbi->s_sbh, "get_write_access");
> err = ext4_journal_get_write_access(handle, sbi->s_sbh);
> + if (!err)
> + sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + mutex_unlock(&sbi->s_ondisk_orphan_lock);
> if (err)
> - goto out_brelse;
> - sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> + goto brelse;
> err = ext4_handle_dirty_super(handle, inode->i_sb);
> } else {
> struct ext4_iloc iloc2;
> @@ -2670,25 +2697,26 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> jbd_debug(4, "orphan inode %lu will point to %u\n",
> i_prev->i_ino, ino_next);
> err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
> + if (!err)
> + NEXT_ORPHAN(i_prev) = ino_next;
> + mutex_unlock(&sbi->s_ondisk_orphan_lock);
> if (err)
> - goto out_brelse;
> - NEXT_ORPHAN(i_prev) = ino_next;
> + goto brelse;
> err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
> }
> if (err)
> - goto out_brelse;
> + goto brelse;
> NEXT_ORPHAN(inode) = 0;
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> err = ext4_mark_iloc_dirty(handle, inode, &iloc);
> -
> -out_err:
> +out_set_err:
> ext4_std_error(inode->i_sb, err);
> -out:
> - mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> return err;
>
> -out_brelse:
> +brelse:
> + UNLOCK_EXT4I_ORPHAN(sbi, ei);
> brelse(iloc.bh);
> - goto out_err;
> + goto out_set_err;
> }
>
> static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 710fed2..a4275d1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3921,7 +3921,16 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
>
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> - mutex_init(&sbi->s_orphan_lock);
> + mutex_init(&sbi->s_ondisk_orphan_lock);
> + sbi->s_orphan_op_mutex = kmalloc(sizeof(struct mutex) <<
> + EXT4_ORPHAN_OP_BITS, GFP_KERNEL);
> + BUG_ON(!sbi->s_orphan_op_mutex);
> + if (sbi->s_orphan_op_mutex) {
> + int n = 1 << EXT4_ORPHAN_OP_BITS;
> +
> + while (n-- > 0)
> + mutex_init(&sbi->s_orphan_op_mutex[n]);
> + }
>
> sb->s_root = NULL;
>
> --
> 1.7.11.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR