Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1D8AC433EF for ; Tue, 11 Jan 2022 22:01:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344267AbiAKWBQ (ORCPT ); Tue, 11 Jan 2022 17:01:16 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:59130 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238155AbiAKWBO (ORCPT ); Tue, 11 Jan 2022 17:01:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1641938473; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=w2jxRqIBbW1BWS5wOVOFlHgZQrT7yiG10aXfva0H0PA=; b=V0Y42/Sfze0yfIuUc5s+GDYRBgM2o5qKyQJYH6IrT/JVhccgcG6SLPzRrl/8cKcMLMmO5n UdDhDS2VD74L+Qgxcw3q44n4FY1EMFGNb3om8uUL6fVE64gX/M7k9oxTBhBCNZ1290e8yf aBTjbZ+Y73NM3+fnbUfP0Rknn1wxNgE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-64---M5XwUAPpC8ia0ZVKJUgA-1; Tue, 11 Jan 2022 17:01:10 -0500 X-MC-Unique: --M5XwUAPpC8ia0ZVKJUgA-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D6B2C1083F60; Tue, 11 Jan 2022 22:01:08 +0000 (UTC) Received: from [10.22.17.167] (unknown [10.22.17.167]) by smtp.corp.redhat.com (Postfix) with ESMTP id 63A015ED29; Tue, 11 Jan 2022 22:01:07 +0000 (UTC) Message-ID: <4a2352a9-42b4-56cd-423a-825faffcd801@redhat.com> Date: Tue, 11 Jan 2022 17:01:06 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0 Subject: Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems Content-Language: en-US To: Jaegeuk Kim Cc: Tim Murray , Christoph Hellwig , LKML , linux-f2fs-devel@lists.sourceforge.net, Peter Zijlstra , Ingo Molnar , Will Deacon , Boqun Feng References: <20220108164617.3130175-1-jaegeuk@kernel.org> <9efbbcb7-29cd-a8ab-0632-01986edc862f@redhat.com> <86891228-9c91-09f1-0e2d-0a3392649d52@redhat.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/11/22 12:07, Jaegeuk Kim wrote: > On 01/11, Waiman Long wrote: >> On 1/11/22 01:53, Tim Murray wrote: >>> On Mon, Jan 10, 2022 at 8:15 PM Waiman Long wrote: >>>> That is not how rwsem works. A reader which fails to get the lock >>>> because it is write-locked will remove its reader count before going to >>>> sleep. So the reader count will be zero eventually. Of course, there is >>>> a short period of time where the reader count will be non-zero until the >>>> reader removes its own reader count. So if a new writer comes in at that >>>> time, it will fail its initial trylock and probably go to optimistic >>>> spinning mode. If the writer that owns the lock release it at the right >>>> moment, the reader may acquire the read lock. >>> Thanks for the correction, that makes sense. I haven't spent too much >>> time on rwsem internals and I'm not confident about when flags are set >>> and cleared in sem->count; is there a case where sem->count after >>> up_write() could be nonzero? >>> >>> An example from one trace: >>> >>> 1. Low-priority userspace thread 4764 is blocked in f2fs_unlink, >>> probably at f2fs_lock_op, which is a wrapper around >>> down_read(cp_rwsem). >>> 2. f2fs-ckpt runs at t=0ms and wakes thread 4764, making it runnable. >>> 3. At t=1ms, f2fs-ckpt enters uninterruptible sleep and blocks at >>> rwsem_down_write_slowpath per sched_blocked_reason. >>> 4. At t=26ms, thread 4764 runs for the first time since being made >>> runnable. Within 40us, thread 4764 unblocks f2fs-ckpt and makes it >>> runnable. >>> >>> Since thread 4764 is awakened by f2fs-ckpt but never runs before it >>> unblocks f2fs-ckpt in down_write_slowpath(), the only idea I had is >>> that cp_rwsem->count is nonzero after f2fs-ckpt's up_write() in step 2 >>> (maybe because of rwsem_mark_wake()?). >>> >>>> I do have a question about the number of readers in such a case compared >>>> with the number of writers. Are there a large number of low priority >>>> hanging around? What is an average read lock hold time? >>>> >>>> Blocking for 9.7s for a write lock is quite excessive and we need to >>>> figure out how this happen., >>> Just to be 100% clear, it's not a single 9.7s stall, it's many smaller >>> stalls of 10-500+ms in f2fs-ckpt that add up to 9.7s over that range. >>> >>> f2fs is not my area of expertise, but my understanding is that >>> cp_rwsem in f2fs has many (potentially unbounded) readers and a single >>> writer. Arbitrary userspace work (fsync, creating/deleting/truncating >>> files, atomic writes) may grab the read lock, but assuming the >>> merge_checkpoint option is enabled, only f2fs-ckpt will ever grab the >>> write lock during normal operation. However, in this particular >>> example, it looks like there may have been 5-10 threads blocked on >>> f2fs-ckpt that were awakened alongside thread 4764 in step 2. >>> >>> I'll defer to the f2fs experts on the average duration that the read >>> lock is held. >> Thanks for the explanation. >> >> Another question that I have is whether the test result is based on the >> latest upstream kernel or earlier kernel version. We used to allow reader >> optimistic spinning which was then removed in later kernel. Reader >> optimistic spinning may further increase writer wait time. > It's on 5.10 kernel having all the upstream f2fs patches, and yes, we wanted > to get higher priority on writer over many readers since the writer, checkpoint, > is the most latency-critical operation that can block all the other filesystem > operations. v5.10 kernel still have reader optimistic spinning enabled in rwsem which may have worsen the writer wait time. Could you try with a more up-to-date kernel or backport the relevant rwsem patches into your test kernel to see how much it can help? >> Anyway, AFAICS, this patch keeps readers out of the rwsem wait queue and so >> only writers can go into it. We can make an unfair rwsem to give preference >> to writers in the wait queue and wake up readers only if there is no more >> writers in the wait queue or even in the optimistic spinning queue. That >> should achieve the same effect as this patch. > Can we get a patch for the variant to test a bit? Meanwhile, I think we can > merge this patch to add a wraper first and switches to it later? Give me a week or so and I can make a RFC patch to support unfair rwsem for you to try out. You do need to try it on the latest kernel, though. Cheers, longman