Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp1545360iob; Fri, 29 Apr 2022 07:37:36 -0700 (PDT) X-Google-Smtp-Source: ABdhPJygdzCgUfW2hRIQah9LiQz+vZWwnFCrAeC8gxAfWBDrKcYw4yGpIjjnmVvSCeIHrQ2sKYnQ X-Received: by 2002:a62:3101:0:b0:50a:db12:bbda with SMTP id x1-20020a623101000000b0050adb12bbdamr40059364pfx.28.1651243055850; Fri, 29 Apr 2022 07:37:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651243055; cv=none; d=google.com; s=arc-20160816; b=fb0eD3GOTOeQbiOxaxK8RvTfO9YjLZPWKnhVHtB9Ucs034wj6SHaV2llFoaqbla5Ki 1nVt8TJe8NDi8h0RR3MaYoHpFWsj6Cgika0QMuQR64dG683+8a27CIiFafbIzsDFsUlw RB6vlGbf1Br4XG0Scgf/S/mY+D5WoTCvp4+t6uNC1P0NofcYAysBbAj6u/CeMCrkheYI MRmTxPPWjHWcPCgjZP6pjv8qmrixiHunZd5800MubSkCRzkMCMaN4czFIPzfZBy1KVuu lxoAFcpevql5lYKsknhOg9kM14yHeRDAzbDpVjJJufqImtscfScoEEirwtgYtNECXTA4 2dlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:organization:subject :from:to:content-language:user-agent:mime-version:date:message-id; bh=csyFg9FEriKdk1qn9g72+OYyERqaMJfzn+nAIUjgaz0=; b=bYK2Jdb32+MaMOQq0+8rMJIx1oxwwgd08wzz0LKMw2k1iblTApLKCluP1q1mEUoDZi 3PgxtYMjj6MmrOA18EX14mskUfSM/rbMizVneE5p3uzzTozgdu0tDN1xrWS75KpuudR0 4Hn4GYrSSnUPNBbSGuMeBjhQ3mygCL5mBX+YKd68FRsjXQqrwi3KhPazhO46Fc8di1kC Nwr7uUNs0BpXuuBTPWVvy6lVxnSMb2MAes5v4PoQvY+x5Tjk4x06ibjLWegXG/+1R3Da 7lALQbamvcZXd3gZCldO42rdDE8FtS7URJLlVpYiR+ENA3ecSlbshNiJp5006bJ8Q0dA PsrQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=uls.co.za Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i71-20020a63874a000000b003aaa13c9669si7003099pge.463.2022.04.29.07.37.19; Fri, 29 Apr 2022 07:37:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=uls.co.za Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1356236AbiD2JFp (ORCPT + 99 others); Fri, 29 Apr 2022 05:05:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230300AbiD2JFn (ORCPT ); Fri, 29 Apr 2022 05:05:43 -0400 Received: from uriel.iewc.co.za (uriel.iewc.co.za [IPv6:2c0f:f720:0:3::9a49:2248]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5A4FC44E8 for ; Fri, 29 Apr 2022 02:02:20 -0700 (PDT) Received: from [2c0f:f720:fe16:700::1] (helo=tauri.local.uls.co.za) by uriel.iewc.co.za with esmtpsa (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nkMVw-000094-MO for linux-kernel@vger.kernel.org; Fri, 29 Apr 2022 11:02:12 +0200 Received: from [192.168.42.209] by tauri.local.uls.co.za with esmtp (Exim 4.94.2) (envelope-from ) id 1nkMVu-0005GJ-RQ for linux-kernel@vger.kernel.org; Fri, 29 Apr 2022 11:02:10 +0200 Message-ID: <89e663fd-beea-3d4f-f8e9-5ecee31102eb@uls.co.za> Date: Fri, 29 Apr 2022 11:02:09 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Content-Language: en-GB To: LKML From: Jaco Kroon Subject: FUSE: serialized readdir Organization: Ultimate Linux Solutions (Pty) Ltd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi All, IMPORTANT DISCLAIMER:  Please note that I'm by no means a specialist when it comes to filesystems, so I may be well off target here. What we are experiencing:  concurrent readdir() operations on a single folder is resulting in really, really bad performance for glusterfs backed FUSE filesystems.  As long as only a single process is iterating a specific folder (which can trivially contain upwards of 100k files) then everything is reasonable (not necessarily great, but good enough), but the moment where concurrent readdir sequences happen, things goes for a stroll, all but one of the processes ends up in uninterruptable wait, and basically things just goes downhill until we manage to back things off and given adequate time it (usually) recovers. I was hammering the glusterfs project on this, with no joy, and glusterfs for one is known for "poor readdir performance".  Or excessively bad "small file performance" (and I highly doubt one can't get much worse than maildir++ structures in this regard) ... in our case we can pin our performance issues for this particular filesystem entirely on readdir().  readdir() performance internal to glusterfs can probably do with a lot of work too, however, I am starting to think that this is a death by multiple cuts scenario.  And this email is intended to get some discussion around one of these potential cuts going. From reading the code under fs/fuse/ (and having no real clue about the larger filesystem core code), my understanding is that all readdir() (and related, eg, getdents64) calls to FUSE filesystems comes through readdir.c. The file_operations structure is not particularly well commented, but filesystems/vfs.rst gives some reasonable explanations. So fuse sets .iterate_shared, and not .iterate - the latter stating supports "concurrent dir iterators" - whether this is at folder level or filesystem level is not explicitly stated, but I'm going to go with filesystem level since that's as worded literally, and if it was per-forlder then FUSE could simply have used .iterate rather them implementing a mutex in private_data. fuse seems to serialize on a per directory basis, this happens in fuse_readdir() - where a mutex on struct file* is taken - one process thus enters the code to first attempt a cached lookup (if permitted) and then an uncached (which goes out to userspace, blocking all other readdir()s on the same folder until userspace has responded). The request going out to userspace (FUSE_READDIRPLUS or simply FUSE_READDIR) contains the position at which the read needs to happen, does it thus make sense to serialize readdir()s to one per folder?  Other than perhaps for cache management? Is there any way to get rid of this serialization?  Or how can I go about figuring out the caching sequence stuff, keeping this cache makes a lot of sense in order to avoid calls out to userspace, but it seems if multiple threads are doing an "initial" uncached scan things gets really messy and ends up causing lots of breakage, especially on larger folders.  Isn't there already a cache at a "higher" layer?  I suspect the fact that the two readers are unlikely going to be around the same location w.r.t. doffset (ctx->pos) this results in lines 447 through 450 resetting pos to 0 - ie, start of scan, and thus effectively clobbering the cash continuously, and restarting reads from userspace from position 0 repeatedly? Anyway, I honestly don't understand the code in detail, especially the cached version, wouldn't mind to but at this stage it feels a bit over my head.  I may very well be very wrong about the above, in which case, please do point me at something I can look at to better understand how this works. Other operations are within acceptable parameters (w.r.t. response times) and we're otherwise comfortably dealing with around 2.5TB of data spread over 10m files in total through FUSE. Other operations too suffer once we get into one of these readdir() loop problems, and even a simply LOOKUP operation at that point (on an unrelated folder) can at these times trivially take up to 60 seconds, normally in the millisecond range. We are currently running a slightly older kernel, after having given 5.17.1 a shot, we've backed down again to 5.8.14 since we know that works (5.17.1 was somehow just annoying us, we raised the one issue and workaround here, and a bugfix was made thanks to the netfilter team, but other things too just felt off which we couldn't quite pinpoint).  Looking at the changelog for 5.17.2 through .5 there are a few possible explanations but none which jumps out to me and screams "this is it".  We still run 5.17.1 on one of the hosts that's slightly "out of band" and doesn't affect our critical path, happy to upgrade that to 5.17.5 as well as to test other patches relating to this. Kind Regards, Jaco