From: Niu Yawei <yawei.niu@gmail.com>
Subject: Re: [PATCH] quota: remove dqptr_sem for scalability
Date: Fri, 23 May 2014 11:37:43 +0800
Message-ID: <537EC287.70002@gmail.com>
References: <537DD5BA.1050105@gmail.com> <20140522132556.GE7999@quack.suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	yawei.niu@intel.com, andreas.dilger@intel.com, lai.siyao@intel.com
To: Jan Kara <jack@suse.cz>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
In-Reply-To: <20140522132556.GE7999@quack.suse.cz>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-ext4.vger.kernel.org

Thanks for your review, Jack.
>   Hello,
>
>   thanks for the work! I have some comments below.
>
> On Thu 22-05-14 18:47:22, Niu Yawei wrote:
>> There are several global locks in the VFS quota code which hurts
>> performance a lot when quota accounting enabled, dqptr_sem is the major one.
>>
>> This patch tries to make the VFS quota code scalable with minimal changes.
>>
>> Following tests (mdtest & dbench) were running over ext4 fs in a
>> centos6.5 vm (8 cpus, 4G mem, kenrel: 3.15.0-rc5+), and the result shows
>> the patch relieved the lock congestion a lot.
>>
> Snipped performance results - thanks for those but first let's concentrate
> on correctness side of things.
>
>> [PATCH] quota: remove dqptr_sem for scalability
>>
>> Remove dqptr_sem (but kept in struct quota_info to keep kernel ABI
>> unchanged), and the functionality of this lock is implemented by
>> other locks:
>> * i_dquot is protected by i_lock, however only this pointer, the
>>   content of this struct is by dq_data_lock.
>> * Q_GETFMT is now protected with dqonoff_mutex instead of dqptr_sem.
>> * Small changes in __dquot_initialize() to avoid unnecessary
>>   dqget()/dqput() calls.
>   Each of these three steps should be a separate patch please.
Ok, sure.
>
>   Now regarding each of these steps: Using i_lock for protection of dquot
> pointers doesn't quite work. You have e.g.:
>> @@ -1636,12 +1646,12 @@ int __dquot_alloc_space(struct inode *inode, qsize_t number, int flags)
>>  	}
>>  	inode_incr_space(inode, number, reserve);
>>  	spin_unlock(&dq_data_lock);
>> +	spin_unlock(&inode->i_lock);
>>  
>>  	if (reserve)
>>  		goto out_flush_warn;
>>  	mark_all_dquot_dirty(dquots);
>>  out_flush_warn:
>> -	up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
>>  	flush_warnings(warn);
>>  out:
>>  	return ret;
>   So you release protection of dquot pointers from inode before calling
> mark_all_dquot_dirty(). So dquot pointers can be removed by
> remove_inode_dquot_ref() while mark_all_dquot_dirty() works on them. That's
> wrong and can lead to use after free.
Indeed, I didn't realise that the getting refcount code has been removed
from these functions,
is it ok to add it back?
>
> Quota code uses dqptr_sem to provide exclusion for three cases:
> * dquot_init()
> * dquot_transfer()
> * various places just reading dquot pointers to update allocation
>   information
> * remove_dquot_ref() (called from quotaoff code)
>
> Now exclusion against remove_dquot_ref() is relatively easy to deal with.
> We can create srcu for dquots, whoever looks at dquot pointers from inode
> takes srcu_read_lock() (so you basically convert read side of dqptr_sem 
> and write side in dquot_init() and dquot_transfer() to that) and use
> synchronize_srcu() in remove_dquot_ref() to make sure all users are done
> before starting to remove dquot pointers. You'll need to move
> dquot_active() checks inside srcu_read_lock() to make this reliable but that
> should be easy.
Ok, but adding back the refcounting code looks more straightforward to
me, may I add them back?
> What remains to deal with is an exclusion between the remaining places.
> dquot_init(). dq_data_lock spinlock should already provide the necessary
> exclusion between readers of allocation pointers and dquot_transfer() (that
> spinlock would actually be a good candidate to a conversion to per-inode
> one - using i_lock for this should noticeably reduce the contention - but
> that's the next step). dquot_init() doesn't take the spinlock so far but
> probably we can make it to take it for simplicity for now.
Using global lock in dquot_initalize() will drop the performance a lot
(because it could
be called several times for a single create/unlink operation), so I'm
afraid that we have to
use per-inode lock (i_lock) to serialize them.
>
>>  static void __dquot_initialize(struct inode *inode, int type)
>>  {
>> -	int cnt;
>> -	struct dquot *got[MAXQUOTAS];
>> +	int cnt, dq_get = 0;
>> +	struct dquot *got[MAXQUOTAS] = { NULL, NULL };
>>  	struct super_block *sb = inode->i_sb;
>>  	qsize_t rsv;
>>  
>> -	/* First test before acquiring mutex - solves deadlocks when we
>> -         * re-enter the quota code and are already holding the mutex */
>>  	if (!dquot_active(inode))
>>  		return;
>>  
>> -	/* First get references to structures we might need. */
>> +	/* In most case, the i_dquot should have been initialized, except
>> +	 * the newly allocated one. We'd always try to skip the dqget() and
>> +	 * dqput() calls to avoid unnecessary global lock contention. */
>> +	if (!(inode->i_state & I_NEW))
>> +		goto init_idquot;
>   The optimization is a good idea but dquot_init() is often called for
> !I_NEW inodes when the initialization is necessary. So I'd rather first
> check whether relevant i_dquot[] pointers are != NULL before taking any
> lock. If yes, we are done, otherwise we grab pointers to dquots, take the
> lock and update the pointers.
Ok, thank you.
>
> 								Honza