Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751491AbbEZTOa (ORCPT ); Tue, 26 May 2015 15:14:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53320 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751015AbbEZTO3 (ORCPT ); Tue, 26 May 2015 15:14:29 -0400 Date: Tue, 26 May 2015 21:13:43 +0200 From: Oleg Nesterov To: Linus Torvalds Cc: Peter Zijlstra , Paul McKenney , Tejun Heo , Ingo Molnar , Linux Kernel Mailing List , der.herr@hofr.at, Davidlohr Bueso Subject: Re: [RFC][PATCH 0/5] Optimize percpu-rwsem Message-ID: <20150526191343.GA5794@redhat.com> References: <20150526114356.609107918@infradead.org> <20150526185727.GA3763@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150526185727.GA3763@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2500 Lines: 77 On 05/26, Oleg Nesterov wrote: > > On 05/26, Linus Torvalds wrote: > > > > > > We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE. > > > > One. > > Well. IIRC Tejun is going to turn signal_struct->group_rwsem into > percpu-rwsem. > > And it can have more users. Say, __sb_start_write/etc does something > similar, and last time I checked this code it looked buggy to me. I have found my old email, see below. Perhaps this code was changed since 2013 when I sent this email, I didn't verify... but in any case this logic doesn't look simple, imo it would be nice to rely on the generic helpers from kernel/locking. Oleg. ------------------------------------------------------------------------------ When I look at __sb_start_write/etc I am not sure this locking is correct. OK, __sb_start_write() does: percpu_counter_inc(); mb(); if (sb->s_writers.frozen) abort_and_retry; freeze_super() does STORE + mb + LOAD in reverse order so either __sb_start_write() must see SB_FREEZE_WRITE or freeze_super() must see the change in ->s_writers.counter. This is correct. Still I am not sure sb_wait_write() can trust percpu_counter_sum(), because it can also see _other_ changes. To simplify the discussion, suppose that percpu_counter doesn't have lock/count/batch/whatever and inc/dec/sum only uses "__percpu *counters". Lets denote sb->s_writers.counter[level] as CTR[cpu]. Suppose that some thread did __sb_start_write() on CPU_1 and sleeps "forever". CTR[0] == 0, CTR_[1] == 1, freezer_super() should block. Now: 1. freeze_super() sets SB_FREEZE_WRITE, does mb(), and starts sb_wait_write()->percpu_counter_sum(). 2. __percpu_counter_sum() does for_each_online_cpu(), reads CTR[0] == 0. ret = 0. 3. Another thread comes, calls __sb_start_write() on CPU_0, increments CTR[0]. Then it notices sb->s_writers.frozen >= level and starts __sb_end_write() before retry. Then it migrates to CPU_1. And decrements CTR[1] before __percpu_counter_sum() reads it. So CTR[0] == 1, CTR[1] == 0. Everything is fine except sb_wait_write() has already read CTR[0]. 4. __percpu_counter_sum() continues, reads CTR[1] == 0 and returns ret == 0. sb_wait_write() returns while it should not? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/