Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp365300imm; Fri, 1 Jun 2018 02:12:45 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJVKHnsCqUw1GokIFC6hSkzeH3sihvnBT+G3f/fonPx8KhDxvU3Vo16x6EqSEku/wBfdejr X-Received: by 2002:a63:69c4:: with SMTP id e187-v6mr8229299pgc.415.1527844365881; Fri, 01 Jun 2018 02:12:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527844365; cv=none; d=google.com; s=arc-20160816; b=ABVp05Jfe03QxIkZucFUW6+OZTpdnZr6sQP1wIcyKy6R8j1/dK2pf6MvdzktTzIUBr 1EZJg/pvv+KY0b1hQIXSNGDF3kowKI5R3W19E6L8Q8M52QCt5aU5ND5gAaKkxMwJ6lOz tc6PVQVfQMnNmoajRy4VadprldWuHxFiRisHZRl2K84X/wsozFAynrosKZ6Zg35LqrLj 1cTVU4ck/r9IY+0Gwcy9w0WKwl4Y7hbnyb263Ot9xukdQCwCeI3JmJ3SZqYzzX8C/nhf xFH8LC/DgX7tQkcpAEMtYPfjTc+j4+xyEFyy4M52C1V5mgSJrCCEMjbl9FkkRJCvC74N 7RZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject:arc-authentication-results; bh=342Xx/pYQ61k+8RMI5xiy7BzcMRbWB88BTxEvcgcCXE=; b=YqRi1PJtsl/nxGasypTr6sirEMz2sPeEzXpV5vMBqjhh8XNba7Q7yuqJ4TMKQjJE0n M0LJhmMv9cWIb8FrTU3Eb/DBDdFAQt+WtlAfgLuUBGqhPqcFb7Vdlbo6r0SRj2fMP+Yq 3fy+Jhr5+b3NNAggDbLkh7Xbsz3JCZx9mbinLS6v3gZRKXvvvh5F4qpgXkB9gslXPlxZ isiFV9Ep7KsTMm+BadkA7WZNMaSTLE5qKfCgXK5x0jj96yRtax5ams6jb3gO7LDSNaV1 uki3zKN5NmYvJV5WtAGDejJ3/fGMnVzwo8j5FutUY2PGInP15F+EI9f2AI1grGhUmmLx Ll+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a11-v6si21474356plt.39.2018.06.01.02.12.31; Fri, 01 Jun 2018 02:12:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750872AbeFAJMD (ORCPT + 99 others); Fri, 1 Jun 2018 05:12:03 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:8620 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750732AbeFAJL6 (ORCPT ); Fri, 1 Jun 2018 05:11:58 -0400 Received: from DGGEMS404-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id F0FCAE337BF6F; Fri, 1 Jun 2018 17:11:44 +0800 (CST) Received: from [10.151.23.176] (10.151.23.176) by smtp.huawei.com (10.3.19.204) with Microsoft SMTP Server (TLS) id 14.3.382.0; Fri, 1 Jun 2018 17:11:39 +0800 Subject: Re: [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system To: Richard Weinberger CC: LKML , linux-fsdevel , , , , , , , , , References: <1527764767-22190-1-git-send-email-gaoxiang25@huawei.com> From: Gao Xiang Message-ID: Date: Fri, 1 Jun 2018 17:11:21 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.151.23.176] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Richard, On 2018/6/1 15:48, Richard Weinberger wrote: > On Thu, May 31, 2018 at 1:06 PM, Gao Xiang wrote: >> Hi all, >> >> Read-only file systems are used in many cases, such as read-only storage media. >> We are now focusing on the Android device which several read-only partitions exist. >> Due to limited read-only solutions, a new read-only file system EROFS >> (Extendable Read-Only File System) is introduced. > > In which sense is it extendable? Actually, the meaning of an enhanced (means not just read-only, but with the scalable on-disk layout, compression, or fs-verify in the future) read-only file system is emphasized. We also think of other candidate full names, such as Enhanced / Extented Read-only File System, all the names short for "erofs" are okay. > >> As the other read-only file systems, several meta regions in generic file systems >> such as free space bitmap are omitted. But the difference is that EROFS focuses >> more on performance than purely on saving storage space as much as possible. >> >> Furthermore, we also add the compression support called z_erofs. >> >> Traditional file systems with the compression support use the fixed-sized input >> compression, the output compressed units could be arbitrary lengths. >> However, data is accessed in the block unit for block devices, which means >> (A) if the accessed compressed data is not buffered, some data read from >> the physical block cannot be further utilized, which is illustrated as follows: >> >> ++-----------++-----------++ ++-----------++-----------++ >> ...|| || || ... || || || ... original data >> ++-----------++-----------++ ++-----------++-----------++ >> \ / \ / >> \ / \ / >> \ / \ / >> ++---|-------++--|--------++ ++-----|----++--------|--++ >> ||xxx| || |xxxxxxxx|| ... ||xxxxx| || |xx|| compressed data >> ++---|-------++--|--------++ ++-----|----++--------|--++ >> >> The shadow regions read from the block device but cannot be used for decompression. >> >> (B) If the compressed data is also buffered, it will increase the memory overhead. >> Because these are compressed data, it cannot be directly used, and we don't know >> when the corresponding compressed blocks are accessed, which is not friendly to >> the random read. >> >> In order to reduce the proportion of the data which cannot be directly decompressed, >> larger compressed sizes are preferred to be selected, which is also not friendly to >> the random read. >> >> Erofs implements the compression in a different approach, the details of which will >> be discussed in the next section. >> >> In brief, the following points summarize our design at a high level: >> >> 1) Use page-sized blocks so that there are no buffer heads. >> >> 2) By introducing a more general inline data / xattr, metadata and small data have >> the opportunity to be read with the inode metadata at the same time. >> >> 3) Introduce another shared xattr region in order to store the common xattrs (eg. >> selinux labels) or xattrs too large to be suitable for meta inline. >> >> 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs >> to organize files and data. >> >> 5) instead of using the fixed-sized input compression, we put forward a new fixed >> output compression to make the full use of IO (which means all data from IO can be >> decompressed), reduce the read amplification, improve random read and keep the >> relatively lower compression ratios, illustrated as follows: >> >> >> |---- varient-length extent ----|------ VLE ------|--- VLE ---| >> /> clusterofs /> clusterofs /> clusterofs /> clusterofs >> ++---|-------++-----------++---------|-++-----------++-|---------++-| >> ...|| | || || | || || | || | ... original data >> ++---|-------++-----------++---------|-++-----------++-|---------++-| >> ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++ >> size size size size size >> \ / / / >> \ / / / >> \ / / / >> ++-----------++-----------++-----------++ >> ... || || || || ... compressed clusters >> ++-----------++-----------++-----------++ >> ++->cluster<-++->cluster<-++->cluster<-++ >> size size size >> >> A cluster could have more than one blocks by design, but currently we only have the >> page-sized cluster implementation (page-sized fixed output compression can also have >> better compression ratio than fixed input compression). >> >> All compressed clusters have a fixed size but could be decompressed into extents with >> arbitrary lengths. >> >> In addition, if a buffered IO reads the following shadow region (x), we could make a more >> customized path (to replace generic_file_buffered_read) which only reads one compressed >> cluster and makes the partial page available. >> /> clusterofs >> ++---|-------++ >> ...|| | xxxx || ... >> ||---|-------|| >> >> Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on >> the server and Android phone (kirin970 platform): >> >> Server (magnetic disk): >> >> compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read >> ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) >> >> 4 480.3 502.5 69.8 11.1 >> 10 472.3 503.3 56.4 10.0 >> 15 457.6 495.3 47.0 10.9 >> 26 401.5 511.2 34.7 11.1 >> 35 389.1 512.5 28.0 11.0 >> 48 375.4 496.5 23.2 10.6 >> 53 370.2 512.0 21.8 11.0 >> 66 349.2 512.0 19.0 11.4 >> 76 310.5 497.3 17.3 11.6 >> 85 301.2 512.0 16.0 11.0 >> 94 292.7 496.5 14.6 11.1 >> 100 538.9 512.0 11.4 10.8 >> >> Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz): > > What storage was used? An eMMC? UFS device, fio with psync, bs=4k, iodepth=1. > >> compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read >> ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) >> >> 4 546.7 544.3 157.7 57.9 >> 10 535.7 521.0 152.7 62.0 >> 15 529.0 520.3 125.0 65.0 >> 26 418.0 526.3 97.6 63.7 >> 35 367.7 511.7 89.0 63.7 >> 48 415.7 500.7 78.2 61.2 >> 53 423.0 566.7 72.8 62.9 >> 66 334.3 537.3 69.8 58.3 >> 76 387.3 546.0 65.2 56.0 >> 85 306.3 546.0 63.8 57.7 >> 94 345.0 589.7 59.2 49.9 >> 100 579.7 556.7 62.1 57.7 > > How does it compare to existing read only filesystems, such as squashfs? > You are quite right. We are now focusing on improving our decompression subsystem and these numbers will be successively added in the future non-RFC patches. We haven't pay much attention on comparing squashfs and erofs yet since we once tried to use squashfs on our products with different block sizes several years ago, it behaves unacceptable in the low free memory scenario besides its performance. This version patchset is mainly used for the opensource archive. Thanks for your attention :) Thanks,