Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7322000imu; Thu, 27 Dec 2018 17:46:14 -0800 (PST) X-Google-Smtp-Source: ALg8bN4u5Dsad9K9KVPATEoMLxTFck2BVjJWLLh//rnt/g01oNTjrz5eN7rYLhVnpVXKNd5PEt9Y X-Received: by 2002:a17:902:aa8c:: with SMTP id d12mr26249680plr.25.1545961574876; Thu, 27 Dec 2018 17:46:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545961574; cv=none; d=google.com; s=arc-20160816; b=SYZFs/JibOnlOJoVOHhbTA6pYIXc1jAzH0WPaC8oRt8hSl3dXQEH6HVH8UzkFLpB2j /u19LJdmpcQg7Kojyj09zXeXtmBlqS/KRWIYRpsZ2D0NgrjJt+L323tXeoK+ztWdZ+ml v0+wWC5t//HpS8j943oylE4C6a07HtprksY7YowYQ7ffwNs6WGn7KXT60+ETo5up/I7r ZRQuxGIMDOTxA1CICoGgOQNZrSi34gO+fk9nGn3DDgoSvBxryDxkGv4qlt6dtpByRcGI Wn7Srb21djKD3b9z90jGIJdUP3uOfpAh4PE4cI41nLV7qbasc1tV5oQ8cmC+7RSJWNOa JWhg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:subject:cc :to:from; bh=Zm8US/a6505SVm7APMXodVG/AhVwqOHAZ4803zWElKE=; b=NU7FJO54FpGNmfvR2uH1awL0FKWCDgEy/9mr148gxtQlF0bjXacmnJykml80GWdKzM wMZNC3zYe7fma5Gh5xrNJXEhqV93gegtBGPOxBfyc9dC8Y6oti8dYk9+NrfUaVW8ZgJk pV7j2FYPMpu79W+unwgMV9r8irKf71pGn6Qo/ZuKoOZnDAuPAMige9RsI3eBqORfhAWe mC86Y93IbZRq/rBGHR8yXw/42q8ux0YVSPUtf3z/psS9ZGs1Ii/FQzfhJQ0XmR0TjJgC /hlgulBKd0YMnB5tIpICONUWt+Tu+v0/XY2pjXl3K0GBFBvafJg/Z/DBvCclYaAua0TA C0SA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z5si24225362plh.133.2018.12.27.17.46.00; Thu, 27 Dec 2018 17:46:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731250AbeL0RZe (ORCPT + 99 others); Thu, 27 Dec 2018 12:25:34 -0500 Received: from albireo.enyo.de ([5.158.152.32]:48208 "EHLO albireo.enyo.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726816AbeL0RZe (ORCPT ); Thu, 27 Dec 2018 12:25:34 -0500 X-Greylist: delayed 430 seconds by postgrey-1.27 at vger.kernel.org; Thu, 27 Dec 2018 12:25:32 EST Received: from [172.17.203.2] (helo=deneb.enyo.de) by albireo.enyo.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) id 1gcZIZ-0004jD-4h; Thu, 27 Dec 2018 17:18:19 +0000 Received: from fw by deneb.enyo.de with local (Exim 4.89) (envelope-from ) id 1gcZIZ-0001Ee-0g; Thu, 27 Dec 2018 18:18:19 +0100 From: Florian Weimer To: linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-ext4@vger.kernel.org Cc: linux-kernel@vger.kernel.org, v9fs-developer@lists.sourceforge.net, libc-alpha@sourceware.org, qemu-devel@nongnu.org, ericvh@gmail.com, rminnich@sandia.gov, lucho@ionkov.net, hpa@zytor.com, arnd@arndb.de Subject: d_off field in struct dirent and 32-on-64 emulation Date: Thu, 27 Dec 2018 18:18:19 +0100 Message-ID: <87bm56vqg4.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We have a bit of an interesting problem with respect to the d_off field in struct dirent. When running a 64-bit kernel on certain file systems, notably ext4, this field uses the full 63 bits even for small directories (strace -v output, wrapped here for readability): getdents(3, [ {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG}, {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR}, {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR} ], 32768) = 88 When running in 32-bit compat mode, this value is somehow truncated to 31 bits, for both the getdents and the getdents64 (!) system call (at least on i386). In an effort to simplify support for future architectures which only have the getdents64 system call, we changed glibc 2.28 to use the getdents64 system call unconditionally, and perform translation if necessary. This translation is noteworthy because it includes overflow checking for the d_ino and d_off members of struct dirent. We did not initially observe a regression because the kernel performs consistent d_off truncation (with the ext4 file system; small directories do not show this issue on XFS), so the overflow check does not fire. However, both qemu-user and the 9p file system can run in such a way that the kernel is entered from a 64-bit process, but the actual usage is from a 32-bit process: I think diagrammatically, this looks like this: guest process (32-bit) | getdents64, 32-bit UAPI qemu-user (64-bit) | getdents, 64-bit UAPI host kernel (64-bit) Or: guest process | getdents64, 32-bit UAPI guest kernel (64-bit) | 9p over virtio (64-bit d_off in struct p9_dirent) qemu | getdents, 64-bit UAPI host kernel (64-bit) Back when we still called getdents, in the first case, the 32-bit getdents system call emulation in a 64-bit qemu-user process would just silently truncate the d_off field as part of the translation, not reporting an error. The second case is more complicated, and I have not figured out where the truncation happens. This truncation has always been a bug; it breaks telldir/seekdir at least in some cases. But use of telldir/seekdir is comparatively rare. In contrast, now that we detect d_off overflow in glibc, readdir will always fail in the sketched configurations, which is bad. (glibc exposes the d_off field to applications, and it cannot know whether the application will use it or not, so there is no direct way to restrict the overflow error to the telldir/seekdir use case.) We could switch glibc to call getdents again if the system call is available. But that merely relies on the existence of the truncation bug somewhere else in the file system stack. This is why I don't think it's the right solution, just the path of least resistance. I don't want to reimplement the ext4 truncation behavior in glibc (it doesn't look like a straightforward truncation), and it wouldn't work for the second scenario where we see the 9p file system in the 32-bit glibc, not the ext4 file system. So that's not a good solution. There is another annoying aspect: The standards expose d_off through the telldir function, and that returns long int on all architectures (not off_t, so unchanged by _FILE_OFFSET_BITS). That's mostly a userspace issue and thus needing different steps to resolve (possibly standards action). Any suggestions how to solve this? Why does the kernel return different d_off values for 32-bit and 64-bit processes even when using getdents64, for the same directory?