Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Fri, 7 Mar 2003 12:24:11 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Fri, 7 Mar 2003 12:24:11 -0500 Received: from mx12.arcor-online.net ([151.189.8.88]:37530 "EHLO mx12.arcor-online.net") by vger.kernel.org with ESMTP id ; Fri, 7 Mar 2003 12:24:10 -0500 Content-Type: text/plain; charset=US-ASCII From: Daniel Phillips To: Alex Tomas , "Martin J. Bligh" Subject: Re: [Bug 417] New: htree much slower than regular ext3 Date: Sat, 8 Mar 2003 18:38:50 +0100 X-Mailer: KMail [version 1.3.2] Cc: linux-kernel , ext2-devel@lists.sourceforge.net, "Theodore Ts'o" , Andrew Morton , Alex Tomas References: <11490000.1046367063@[10.10.2.4]> In-Reply-To: MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Message-Id: <20030307173425.5C4D3FAAAE@mx12.arcor-online.net> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2817 Lines: 56 On Fri 07 Mar 03 16:46, Alex Tomas wrote: > Hi! > > The problem is that getdents(2) returns inodes out of order and > du causes many head seeks. I tried to solve the problem by patch > I included in. The idea of the patch is pretty simple: just try > to sort dentries by inode number in readdir(). It works because > inodes sits at fixed places in ext2/ext3. Please, look at results > I just got: > > real user sys > > ext3+htree: 7m22.236s 0m0.292s 0m2.204s > ext3-htree: 3m6.539s 0m0.481s 0m2.278s > [...] > ext3+sd-30: 2m39.658s 0m0.492s 0m2.138s Nice. I posted a similar demonstration some time ago (could it *really* be two years?) but I sorted the dirents in user space, which meant it was just a demonstration, not a practical solution. The problem I see with your approach is that the traversal is no longer in hash order, so a leaf split in the middle of a directory traversal could result in a lot of duplicate dirents. I'm not sure there's a way around that. Another approach is to have the inode number correspond approximately to the hash value. That's not as hard as it sounds, it simply means that the goal for ext3_new_inode should be influenced by the hash value. (But watch out for interaction with the new Orlov allocator.) It also depends on some a priori estimate of the size of a directory so that increasing hash values can be distributed more or less uniformly and monotonically across some corresponding range of inode numbers. This isn't too hard either, just round up the current size of the directory to a power of two and use that as the size estimate. The size estimate would change from time to time over the life of the directory, but there are only log N different sizes and that's roughly how heavy the load on the inode cache would be during a directory traversal. Finally, a nice property of this approach is that it stays stable over many creates and deletes. We've apparently got a simpler problem though: inode numbers aren't allocated densely enough in the inode table blocks. This is the only thing I can think of that could cause the amount of thrashing we're seeing. Spreading inode number allocations out all over a volume is ok, but sparse allocation within each table block is not ok because it multiplies the pressure on the cache. We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks, not the worst-case 30,000. As above, fixing this comes down to tweaking the ext3_new_inode allocation goal. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/