From: Andreas Dilger Subject: Re: bigalloc and max file size Date: Mon, 31 Oct 2011 10:34:00 -0600 Message-ID: <0C9B7A59-644C-4ABF-8021-37632B49B035@dilger.ca> References: <51BECC2B-2EBC-4FCB-B708-8431F7CB6E0D@dilger.ca> <5846CEDC-A1ED-4BB4-8A3E-E726E696D3E9@mit.edu> <97D9C5CC-0F22-4BC7-BDFA-7781D33CA7F3@whamcloud.com> <4EACE2B7.9070402@coly.li> <4EAE6BD4.9080705@coly.li> <583E0040-4EFA-4EBC-A738-A8968BB9135C@mit.edu> <422BEB28-76D0-4FD8-B7AE-130C9AAE10C0@dilger.ca> Mime-Version: 1.0 (iPhone Mail 8L1) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT Cc: Theodore Tso , "i@coly.li" , Andreas Dilger , linux-ext4 development , Alex Zhuravlev , Tao Ma , "hao.bigrat@gmail.com" To: Andreas Dilger Return-path: Received: from shawmail.shawcable.com ([64.59.128.220]:8063 "EHLO mail.shawcable.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752977Ab1JaQdI convert rfc822-to-8bit (ORCPT ); Mon, 31 Oct 2011 12:33:08 -0400 In-Reply-To: <422BEB28-76D0-4FD8-B7AE-130C9AAE10C0@dilger.ca> Sender: linux-ext4-owner@vger.kernel.org List-ID: On 2011-10-31, at 10:08 AM, Andreas Dilger wrote: > On 2011-10-31, at 4:22 AM, Theodore Tso wrote: > >> For cluster file systems, such as when you might build Hadoop on top of ext4, there's no real advantage of using RAID arrays as opposed to having single file systems on each disk. In fact, due to the specd of being able to check multiple disk spindles in parallel, it's advantageous to build cluster file systems on single disk file systems. > > For Lustre at least there are a number of reasons why it uses large RAID devices to store the data instead of many small devices: > - fewer devices that need to be managed at the Filesystem level. Lustre runs on systems with more than 13000 drives, and having to manage connection state for that many internal devices is a lot of overhead. Doh, hit send too soon... - reduced complexity of filesystem allocation decisions with fewer large LUNs vs many smaller LUNs - reduced free space and file fragmentation with fewer large LUNs, since the block allocator for each LUN has more blocks to choose from - sysadmin of so many unique devices is difficult, while clustering them into RAID sets with hardware management features (we call this blinkenlights) makes this tractable compared to software RAID on generic hardware. - performance management of the RAID hardware can detect and mask individual drives that are slow compared to others in that RAID set, which is much harder if each drive is treated individually These reasons don't apply to all cluster filesystems, but I thought I'd chime in on why we use large LUNs even though we could also handle more smaller LUNs. Cheers, Andreas