Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp302116imm; Fri, 1 Jun 2018 00:49:04 -0700 (PDT) X-Google-Smtp-Source: ADUXVKI+LggDbg4PVOQz/2+rtRiWB1+M3Wv7hxsOzg7+mwIYAmxoYG5Nm2WmWtaK1q4IHPxVDTDI X-Received: by 2002:a17:902:74c8:: with SMTP id f8-v6mr10277935plt.317.1527839344413; Fri, 01 Jun 2018 00:49:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527839344; cv=none; d=google.com; s=arc-20160816; b=uhhxsbfQVTA2w5tQGhMMReiQggWnGyP5db2y16AjwrusUyKIIynzYS8BDqERSfqnP9 lhuib0BGXG5LfbuLy6Y1NwQbeR4LIi96LLaX2b+YSEPF0fJd3b1Oiv7h+VZyFhdhOgEd 3j8b/SUW067wcDupZwdffNfiZZh26ChoA2J2T9kc1/JWCTsTBigI/D3UzRZwDoFTT+qc yXlPRsO+6g3wAf+gHfAbeo4MH9NcZwAHa5LoACCUlp7z6ODChvRUVm8PAi5rXK2FuFrj UyCB873w4mvKVPkHVoaI9xi9pUN/vtSE4xSY0y6iWCASIbv4FPNI6ed0hbJWtt+/xtTt ZmQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=aLZXZ5oQ7z5g48fCVcqUoQ2b6QHVkDi5J5gvnIHNCKQ=; b=ZwZ2k5HjiK/tTU5V23tamarhBZYy2MfEm090p640ULPA5+MofNxVNxIQS//NkXykS2 Xqv8tGNUIR7kayV/zUMnqJCVjzEc1LKXkWevL/X2TMPMjsDPJ45XcjY5VITp6AuEZEQz xGBtCuFTrKkKBLZe+QsRMzWAWvKdaYSa3TaDpofN/IXaQbkC3ezyke41Ig9fqZLd+6V0 4GSnKlEuMGxcakiPn4jj8i79eYdJ+oLTRd+4Qt8iTav7j+abdS8rI8ZgWMdA9FrsNpdg /oHecJ8NeXugZM+fUr/71nrfVnNTKlbmk4i/7HGuLV8V5MuYfmOddCvMnF5FMFlWP00h /7zA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ZZpl6XNJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v25-v6si3652790pfn.191.2018.06.01.00.48.48; Fri, 01 Jun 2018 00:49:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ZZpl6XNJ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750861AbeFAHsU (ORCPT + 99 others); Fri, 1 Jun 2018 03:48:20 -0400 Received: from mail-wm0-f47.google.com ([74.125.82.47]:40454 "EHLO mail-wm0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750991AbeFAHsO (ORCPT ); Fri, 1 Jun 2018 03:48:14 -0400 Received: by mail-wm0-f47.google.com with SMTP id x2-v6so1003382wmh.5; Fri, 01 Jun 2018 00:48:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=aLZXZ5oQ7z5g48fCVcqUoQ2b6QHVkDi5J5gvnIHNCKQ=; b=ZZpl6XNJA9JvSnRxt4MkapBbzapbCOB/cOD0kOhycZy0iwNvs53PuGV4YEoXbA0k/J 154MAUjlVzudNLI1BrNHxsEM3OedqZhwbYkP0rRpik3tkNQsuEhJbWpRO3BOlnyNd7R9 vq+YF7xQZTmnQq10uOWFmND1AYeNAR6szoSj1yystdRNPo3GxHPakpQo1PKzQC+9VCSL /MxFEl1a0U3QV3EO1PDdNomJeZfLPCqeaqq4pF1oyDtD7aBa7akHd/lz82hxNGj4XE8E zhKZwWnfx5PZc4uOe0fe0G1iKQqHt+Imxs4g+pUT3uBBbOECctfuXFjjteIAeZz5atoc Ixpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=aLZXZ5oQ7z5g48fCVcqUoQ2b6QHVkDi5J5gvnIHNCKQ=; b=laNpsoGLJbMLYQ2i1kE7AGZfwBt8D4GQBJIG3imkkd8K7ossvPV7dIwnMzGqu4jUIz 0mfGyMhInJ6W+u8uR8xqMZunXTxdlx1V/v0S75Dx6G+KKMJCF/1OlXPEfVK2aSjhP9aJ SFNdhp/8xtSyT1iMVEkrpJkieCMTCY7fINv7J4ABLSQy9RTGBdpPyf+uIdLrKt0DJeoO zYdwH4jlRk7l3XvTrepphqpjviyBnJ/y9f91FrXue6vcL34f6doKK39HZ/rw20g9L94m WQ2tDshiQ9yGFeHqaeBzsqv3GJwcwFVkquct8qt4XeoOKU7J7mA8Bv6iYIVS7wPLo61q vYtg== X-Gm-Message-State: APt69E12Qo9f/ZljkWGky94/N/gLjQ7lXLgIOZ5ufBpw9pn0+HL0xf8j r+1JjtgbxIUio54fsVNc3Sd+QNz2+G/ngD1WS/4= X-Received: by 2002:a1c:afc3:: with SMTP id y186-v6mr1753955wme.87.1527839292852; Fri, 01 Jun 2018 00:48:12 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:adf:8271:0:0:0:0:0 with HTTP; Fri, 1 Jun 2018 00:48:12 -0700 (PDT) In-Reply-To: <1527764767-22190-1-git-send-email-gaoxiang25@huawei.com> References: <1527764767-22190-1-git-send-email-gaoxiang25@huawei.com> From: Richard Weinberger Date: Fri, 1 Jun 2018 09:48:12 +0200 Message-ID: Subject: Re: [NOMERGE] [RFC PATCH 00/12] erofs: introduce erofs file system To: Gao Xiang Cc: LKML , linux-fsdevel , miaoxie@huawei.com, yuchao0@huawei.com, sunqiuyang@huawei.com, fangwei1@huawei.com, liguifu2@huawei.com, weidu.du@huawei.com, chen.chun.yen@huawei.com, brooke.wangzhigang@hisilicon.com, dongjinguang@huawei.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 31, 2018 at 1:06 PM, Gao Xiang wrote: > Hi all, > > Read-only file systems are used in many cases, such as read-only storage media. > We are now focusing on the Android device which several read-only partitions exist. > Due to limited read-only solutions, a new read-only file system EROFS > (Extendable Read-Only File System) is introduced. In which sense is it extendable? > As the other read-only file systems, several meta regions in generic file systems > such as free space bitmap are omitted. But the difference is that EROFS focuses > more on performance than purely on saving storage space as much as possible. > > Furthermore, we also add the compression support called z_erofs. > > Traditional file systems with the compression support use the fixed-sized input > compression, the output compressed units could be arbitrary lengths. > However, data is accessed in the block unit for block devices, which means > (A) if the accessed compressed data is not buffered, some data read from > the physical block cannot be further utilized, which is illustrated as follows: > > ++-----------++-----------++ ++-----------++-----------++ > ...|| || || ... || || || ... original data > ++-----------++-----------++ ++-----------++-----------++ > \ / \ / > \ / \ / > \ / \ / > ++---|-------++--|--------++ ++-----|----++--------|--++ > ||xxx| || |xxxxxxxx|| ... ||xxxxx| || |xx|| compressed data > ++---|-------++--|--------++ ++-----|----++--------|--++ > > The shadow regions read from the block device but cannot be used for decompression. > > (B) If the compressed data is also buffered, it will increase the memory overhead. > Because these are compressed data, it cannot be directly used, and we don't know > when the corresponding compressed blocks are accessed, which is not friendly to > the random read. > > In order to reduce the proportion of the data which cannot be directly decompressed, > larger compressed sizes are preferred to be selected, which is also not friendly to > the random read. > > Erofs implements the compression in a different approach, the details of which will > be discussed in the next section. > > In brief, the following points summarize our design at a high level: > > 1) Use page-sized blocks so that there are no buffer heads. > > 2) By introducing a more general inline data / xattr, metadata and small data have > the opportunity to be read with the inode metadata at the same time. > > 3) Introduce another shared xattr region in order to store the common xattrs (eg. > selinux labels) or xattrs too large to be suitable for meta inline. > > 4) Metadata and data could be mixed by design, so it could be more flexible for mkfs > to organize files and data. > > 5) instead of using the fixed-sized input compression, we put forward a new fixed > output compression to make the full use of IO (which means all data from IO can be > decompressed), reduce the read amplification, improve random read and keep the > relatively lower compression ratios, illustrated as follows: > > > |---- varient-length extent ----|------ VLE ------|--- VLE ---| > /> clusterofs /> clusterofs /> clusterofs /> clusterofs > ++---|-------++-----------++---------|-++-----------++-|---------++-| > ...|| | || || | || || | || | ... original data > ++---|-------++-----------++---------|-++-----------++-|---------++-| > ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++ > size size size size size > \ / / / > \ / / / > \ / / / > ++-----------++-----------++-----------++ > ... || || || || ... compressed clusters > ++-----------++-----------++-----------++ > ++->cluster<-++->cluster<-++->cluster<-++ > size size size > > A cluster could have more than one blocks by design, but currently we only have the > page-sized cluster implementation (page-sized fixed output compression can also have > better compression ratio than fixed input compression). > > All compressed clusters have a fixed size but could be decompressed into extents with > arbitrary lengths. > > In addition, if a buffered IO reads the following shadow region (x), we could make a more > customized path (to replace generic_file_buffered_read) which only reads one compressed > cluster and makes the partial page available. > /> clusterofs > ++---|-------++ > ...|| | xxxx || ... > ||---|-------|| > > Some numbers using fixed output compression (VLE, cluster size = block size = 4k) on > the server and Android phone (kirin970 platform): > > Server (magnetic disk): > > compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read > ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) > > 4 480.3 502.5 69.8 11.1 > 10 472.3 503.3 56.4 10.0 > 15 457.6 495.3 47.0 10.9 > 26 401.5 511.2 34.7 11.1 > 35 389.1 512.5 28.0 11.0 > 48 375.4 496.5 23.2 10.6 > 53 370.2 512.0 21.8 11.0 > 66 349.2 512.0 19.0 11.4 > 76 310.5 497.3 17.3 11.6 > 85 301.2 512.0 16.0 11.0 > 94 292.7 496.5 14.6 11.1 > 100 538.9 512.0 11.4 10.8 > > Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz): What storage was used? An eMMC? > compression EROFS seq read EXT4 seq read EROFS random read EXT4 random read > ratio bw[MB/s] bw[MB/s] bw[MB/s] (20%) bw[MB/s] (20%) > > 4 546.7 544.3 157.7 57.9 > 10 535.7 521.0 152.7 62.0 > 15 529.0 520.3 125.0 65.0 > 26 418.0 526.3 97.6 63.7 > 35 367.7 511.7 89.0 63.7 > 48 415.7 500.7 78.2 61.2 > 53 423.0 566.7 72.8 62.9 > 66 334.3 537.3 69.8 58.3 > 76 387.3 546.0 65.2 56.0 > 85 306.3 546.0 63.8 57.7 > 94 345.0 589.7 59.2 49.9 > 100 579.7 556.7 62.1 57.7 How does it compare to existing read only filesystems, such as squashfs? -- Thanks, //richard