Received: by 2002:a25:e7d8:0:0:0:0:0 with SMTP id e207csp1122912ybh; Tue, 10 Mar 2020 15:15:47 -0700 (PDT) X-Google-Smtp-Source: ADFU+vs91bPCGrnIGGBWQwflQVFh2KtxxfcBzDGiej/2eCwI8EJZhAu4fZA28/7VVLBMFsJFmJuh X-Received: by 2002:a9d:6197:: with SMTP id g23mr19044926otk.239.1583878547221; Tue, 10 Mar 2020 15:15:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1583878547; cv=none; d=google.com; s=arc-20160816; b=s8T/uC5dzNoDLfBmYuBYVTVacshr8879o/ZvJlW/TGM+njLSM4RAA5yjhrKqFOrIHr xAHqw42w4aHRv9B2yhJfeykjboJeONaaPcun8C+mwMe6GIP4Trd4my1sHyNpwMS3BtmM HB1XHQRz0Id3AZ0HuQjeovPVQ3fLHyFh9K/9XRztDXIjR2+Ai8OI/QZfdYeoluXsq0ql MyLJWe2Xv9nAWR5YwnjeIv4Ld2uz7xRKngL9fJjRKJBgUzPcVypwYQm1CsvnSQY40YGm L+LYH+omET0qSxPpKuE0SOWSS2MYQL9odoE/CKR8SstAE4xqAfn1ESw5boc2+TCRWnZ+ VuYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=1Sr6VVeu2ul9v5B/KV9mfQwGmhUc52joodz90vqC3/0=; b=GlukyRuFIhjbs7MW9++D09fyIaeMTqBR1gewu//FdrYlAxE3mVHZgzt/+r7bPosQkg 6VDmeovDA/1Gm5N1OPuE+iWWzajOA3t5ZU8fh8Ir1HTb+ujviRx9EbnXBQAWcRDyrZbh twKIfDPId5UzQxVvvHIeF861cBeT5ds5hzj/+B03pD/5g+dQklfT6+XcFNfhLdJ21DEj asDmDyfPRoaXw2BOotrsAFofFNTAkbUQEDXVauq//qxTYSOkNk5e4LyYV3VTT2iCVNBE qM5+Yq21evTGwjg4TmFDi89iFweu3gaCW9/pJX12SLMA2Xk7gQxeT3Mo9qwkK8K2IE39 aOrg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w21si66778oia.257.2020.03.10.15.15.35; Tue, 10 Mar 2020 15:15:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727648AbgCJWOM (ORCPT + 99 others); Tue, 10 Mar 2020 18:14:12 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:43016 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726273AbgCJWOM (ORCPT ); Tue, 10 Mar 2020 18:14:12 -0400 Received: from dread.disaster.area (pa49-195-202-68.pa.nsw.optusnet.com.au [49.195.202.68]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id D21B17E991C; Wed, 11 Mar 2020 09:14:08 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jBn8Y-0004cq-Mh; Wed, 11 Mar 2020 09:14:06 +1100 Date: Wed, 11 Mar 2020 09:14:06 +1100 From: Dave Chinner To: "Rantala, Tommi T. (Nokia - FI/Espoo)" Cc: "darrick.wong@oracle.com" , "linux-xfs@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "hch@lst.de" Subject: Re: 5.5 XFS getdents regression? Message-ID: <20200310221406.GO10776@dread.disaster.area> References: <72c5fd8e9a23dde619f70f21b8100752ec63e1d2.camel@nokia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <72c5fd8e9a23dde619f70f21b8100752ec63e1d2.camel@nokia.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=LYdCFQXi c=1 sm=1 tr=0 a=mqTaRPt+QsUAtUurwE173Q==:117 a=mqTaRPt+QsUAtUurwE173Q==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=SS2py6AdgQ4A:10 a=pM9yUfARAAAA:8 a=TXhh1AoAAAAA:20 a=7-415B0cAAAA:8 a=kddjMyLgLR5ZLaLsPHMA:9 a=CjuIK1q_8ugA:10 a=YH-7kEGJnRg4CV3apUU-:22 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote: > Hello, > > One of my GitLab CI jobs stopped working after upgrading server 5.4.18- > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64. > (tested 5.5.8-100.fc30.x86_64 too, no change) > The server is fedora30 with XFS rootfs. > The problem reproduces always, and takes only couple minutes to run. > > The CI job fails in the beginning when doing "git clean" in docker > container, and failing to rmdir some directory: > "warning: failed to remove > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/intern > al/socket: Directory not empty" > > Quick google search finds some other people reporting similar problems > with 5.5.0: > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185 Which appears to be caused by multiple gitlab processes modifying the directory at the same time. i.e. something is adding an entry to the directory at the same time something is trying to rm -rf it. That's a race condition, and would lead to the exact symptoms you see here, depending on where in the directory the new entry is added. > Collected some data with strace, and it seems that getdents is not > returning all entries: > > 5.4 getdents64() returns 52+50+1+0 entries > => all files in directory are deleted and rmdir() is OK > > 5.5 getdents64() returns 52+50+0+0 entries > => rmdir() fails with ENOTEMPTY Yup, that's a classic userspace TOCTOU race. Remember, getdents() is effectively a sequential walk through the directory data - subsequent calls start at the offset (cookie) where the previous one left off. New entries can be added between getdents() syscalls. If that new entry is put at the tail of the directory, then the last getdents() call will return that entry rather than none because it was placed at an offset in the directory that the getdents() sweep has not yet reached, and hence will be found by a future getdents() call in the sweep. However, if there is a hole in the directory structure before the current getdents cookie offset, a new entry can be added in that hole. i.e. at an offset in the directory that getdents has already passed over. That dirent will never be reported by the current getdents() sequence - a directory rewind and re-read is required to find it. i.e. there's an inherent userspace TOUTOC race condition in 'rm -rf' operations. IOWs, this is exactly what you'd expect to see when there are concurrent userspace modifications to a directory that is currently being read. Hence you need to rule out an application and userspace level issues before looking for filesystem level problems. Cheers, Dave. -- Dave Chinner david@fromorbit.com