Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp1115316yba; Wed, 24 Apr 2019 15:26:26 -0700 (PDT) X-Google-Smtp-Source: APXvYqyWM6kSeWzSX4KTlNCiAEhKiXgoB9zclKfqS2blVmw5LXwlSgEyB6P6CYY5W6xm/3rT8Slf X-Received: by 2002:a63:e956:: with SMTP id q22mr14475782pgj.277.1556144786314; Wed, 24 Apr 2019 15:26:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556144786; cv=none; d=google.com; s=arc-20160816; b=zaZ470nwKGCZMBq1DrgM3NAMhXOgRUip8xw4y8jX94Ny2i/8WsiJPUK40P7+PV22m1 2PmykBr1vdKD4lVtWvYXICis5IRGMGd6JA+AFG0GOQhqDzqRxGTO9/t7jn4vNwNil38I rmoGmCJil+FkXR8uk/TATitBeGiBHX7RlHH31r9ozlLbt8aDINs8HZP9TyvPaY/r8AQ7 a41iN2KPtzKNBUu9dnn/o/G7BOou5517ziIESEcdOIYdd0LfMiyBqVJU/kpSle0mO6T1 cLnjoTJvoi48cBb7I/ccTDZouq6X7BlMh4OEmroUCMjyUVeoVz6VWW8oeHsiKJ/D0ilW 6wnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject; bh=hKe1FgguGV3Kbguq0/Jon3Svreaff60PIJHXLShXOTY=; b=jt56HToo6I41yChAD3IzPs1cnvv9VqUP1EmaFZ+2LfNq0JLEpCV/hrZv/X9AfYs5+J CvBukdyLbU+MXJoFr6wDZTrqQekzlwS6n3+P4j6wJXDLxaGrHGA94fiQ6yTE4DUfSVf3 NkqhudkV7o3Ysl6WOue2RX/+Rgn4ADoM45uRa8ejxrB11fiMMYC6TOVfvhQ9xGr3JnCg UEQzR9iRsfRcjBm7UG6l2R21xAssw6mmMYen7Cy1ezsG/vDAOSKhSzSX4k2d+aCiVb0s YiDyfIRlLT924QRXBMLUNUhseoWMoGHizhtl0EMqC7aTQRqsQkrf00jzX6vGSuyxlnZ6 1Wmg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k3si19929409pfi.229.2019.04.24.15.26.10; Wed, 24 Apr 2019 15:26:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733132AbfDXRKU (ORCPT + 99 others); Wed, 24 Apr 2019 13:10:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45020 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729842AbfDXRKU (ORCPT ); Wed, 24 Apr 2019 13:10:20 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4218730ABABA; Wed, 24 Apr 2019 17:10:19 +0000 (UTC) Received: from llong.remote.csb (dhcp-17-85.bos.redhat.com [10.18.17.85]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1F2B95D704; Wed, 24 Apr 2019 17:10:18 +0000 (UTC) Subject: Re: [PATCH v4 14/16] locking/rwsem: Guard against making count negative To: Peter Zijlstra Cc: Linus Torvalds , Ingo Molnar , Will Deacon , Thomas Gleixner , Linux List Kernel Mailing , the arch/x86 maintainers , Davidlohr Bueso , Tim Chen , huang ying References: <20190419120207.GO4038@hirez.programming.kicks-ass.net> <20190419130304.GV14281@hirez.programming.kicks-ass.net> <20190419131522.GW14281@hirez.programming.kicks-ass.net> <57620139-92a3-4a21-56bd-5d6fff23214f@redhat.com> <7b1bfc26-6e90-bd65-ab46-08413acd80e9@redhat.com> <20190423141714.GO11158@hirez.programming.kicks-ass.net> <4f62d7f2-e5f6-500e-3e70-b1d1978f7140@redhat.com> <20190424070959.GE4038@hirez.programming.kicks-ass.net> <51589ac0-3e1f-040e-02bf-b6de77cbda1d@redhat.com> <20190424170148.GR12232@hirez.programming.kicks-ass.net> From: Waiman Long Organization: Red Hat Message-ID: <159efb9b-87df-151f-28df-42407592ea3f@redhat.com> Date: Wed, 24 Apr 2019 13:10:17 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190424170148.GR12232@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.40]); Wed, 24 Apr 2019 17:10:19 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/24/19 1:01 PM, Peter Zijlstra wrote: > On Wed, Apr 24, 2019 at 12:49:05PM -0400, Waiman Long wrote: >> On 4/24/19 3:09 AM, Peter Zijlstra wrote: >>> On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote: >>>> That is true in general, but doing preempt_disable/enable across >>>> function boundary is ugly and prone to further problems down the road. >>> We do worse things in this code, and the thing Linus proposes is >>> actually quite simple, something like so: >>> >>> --- >>> --- a/kernel/locking/rwsem.c >>> +++ b/kernel/locking/rwsem.c >>> @@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap >>> raw_spin_unlock_irq(&sem->wait_lock); >>> break; >>> } >>> - schedule(); >>> + schedule_preempt_disabled(); >>> lockevent_inc(rwsem_sleep_reader); >>> } >>> >>> @@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr >>> */ >>> inline void __down_read(struct rw_semaphore *sem) >>> { >>> + preempt_disable(); >>> if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >>> &sem->count) & RWSEM_READ_FAILED_MASK)) { >>> rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); >>> @@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph >>> } else { >>> rwsem_set_reader_owned(sem); >>> } >>> + preempt_enable(); >>> } >>> >>> static inline int __down_read_killable(struct rw_semaphore *sem) >>> { >>> + preempt_disable(); >>> if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >>> &sem->count) & RWSEM_READ_FAILED_MASK)) { >>> if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE))) >>> @@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s >>> } else { >>> rwsem_set_reader_owned(sem); >>> } >>> + preempt_enable(); >>> return 0; >>> } >>> >> Making that change will help the slowpath to has less preemption points. > That doesn't matter, right? Either it blocks or it goes through quickly. > > If you're worried about a parituclar spot we can easily put in explicit > preemption points. > >> For an uncontended rwsem, this offers no real benefit. Adding >> preempt_disable() is more complicated than I originally thought. > I'm not sure I get your objection? > >> Maybe we are too paranoid about the possibility of a large number of >> preemptions happening just at the right moment. If p is the probably of >> a preemption in the middle of the inc-check-dec sequence, which I have >> already moved as close to each other as possible. We are talking a >> probability of p^32768. Since p will be really small, the compound >> probability will be infinitesimally small. > Sure; but we run on many millions of machines every second, so the > actual accumulated chance of it happening eventually is still fairly > significant. > >> So I would like to not do preemption now for the current patchset. We >> can restart the discussion later on if there is a real concern that it >> may actually happen. Please let me know if you still want to add >> preempt_disable() for the read lock. > I like provably correct schemes over prayers. I am fine with adding preempt_disable(). I just want confirmation that you want to have that. > > As you noted, distros don't usually ship with PREEMPT=y and therefore > will not be bothered much by any of this. > > The old scheme basically worked by the fact that the total supported > reader count was higher than the number of addressable pages in the > system and therefore the overflow could not happen. > > We now transition to number of CPUs, and for that we pay a little price > with PREEMPT=y kernels. Either that or cmpxchg. I also thought about switching to a cmpxchg loop for PREEMPT=y kernel. Let start with just preempt_disable() for now. We can evaluate the cmpxchg loop alternative later on. Cheers, Longman