2012-10-05 11:56:11

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 01/16] f2fs: add document

This adds a document describing the mount options, proc entries, usage, and
design of Flash-Friendly File System, namely F2FS.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
2 files changed, 316 insertions(+)
create mode 100644 Documentation/filesystems/f2fs.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8c624a1..ce5fd46 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -48,6 +48,8 @@ ext4.txt
- info, mount options and specifications for the Ext4 filesystem.
files.txt
- info on file management in the Linux kernel.
+f2fs.txt
+ - info and mount options for the F2FS filesystem.
fuse.txt
- info on the Filesystem in User SpacE including mount options.
gfs2.txt
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
new file mode 100644
index 0000000..cd3f846
--- /dev/null
+++ b/Documentation/filesystems/f2fs.txt
@@ -0,0 +1,314 @@
+================================================================================
+WHAT IS Flash-Friendly File System (F2FS)?
+================================================================================
+
+NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
+been widely being used for ranging from mobile to server systems. Since they are
+known to have different characteristics from the conventional rotational disks,
+a file system, an upper layer to the storage device, should adapt to the changes
+from the sketch.
+
+F2FS is a file system exploiting NAND flash memory-based storage devices, which
+is based on Log-structured File System (LFS). The design has been focused on
+addressing the fundamental issues in LFS, which are snowball effect of wandering
+tree and high cleaning overhead.
+
+Since a NAND flash memory-based storage device shows different characteristic
+according to its internal geometry or flash memory management scheme aka FTL,
+F2FS and its tools support various parameters not only for configuring on-disk
+layout, but also for selecting allocation and cleaning algorithms.
+
+The file system formatting tool, "mkfs.f2fs", is available from the following
+download page: http://sourceforge.net/projects/f2fs-tools/
+
+================================================================================
+MOUNT OPTIONS
+================================================================================
+
+background_gc_off Turn off the cleaning operation, aka garbage collection,
+ in background triggered when I/O subsystem is idle.
+disable_roll_forward Disable the roll-forward recovery routine during SPOR.
+discard Issue discard/TRIM commands when a segment is cleaned.
+no_heap Disable heap-style segment allocation in which finds free
+ segments for data from the beginning of main area, while
+ for node from the end of main area.
+nouser_xattr Disable Extened User Attributes. Note: xattr is enabled
+ by default if CONFIG_F2FS_FS_XATTR is selected.
+noacl Disable POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
+
+================================================================================
+PROC ENTRIES
+================================================================================
+
+/proc/fs/f2fs/ contains information about partitions mounted as f2fs. For each
+partition, a corresponding directory, named as its device name, is provided with
+the following proc entries.
+
+- f2fs_stat major file system information managed by f2fs currently
+- f2fs_sit_stat average SIT information about whole segments
+- f2fs_mem_stat current memory footprint consumed by f2fs
+
+e.g., in /proc/fs/f2fs/sdb1/
+
+================================================================================
+USAGE
+================================================================================
+
+1. Download userland tools
+
+2. Insmod f2fs.ko module:
+ # insmod f2fs.ko
+
+3. Check the directory trying to mount
+ # mkdir /mnt/f2fs
+
+4. Format the block device, and then mount as f2fs
+ # mkfs.f2fs -l label /dev/block_device
+ # mount -t f2fs /dev/block_device /mnt/f2fs
+
+================================================================================
+DESIGN
+================================================================================
+
+On-disk Layout
+--------------
+
+F2FS divides whole volume into a number of segments each of which size is 2MB by
+default. A section is composed of consecutive segments, and a zone consists of a
+set of sections.
+
+F2FS maintains logically six log areas. Except SB, all the log areas are managed
+in a unit of multiple segments. SB is located at the beggining of the partition,
+and there exist two superblocks to avoid file system crash. Other file system
+metadata such as CP, NAT, SIT, and SSA are located in front part of the volume.
+Main area contains file and directory data including their indices.
+
+Each area manages the following contents.
+- CP File system information, bitmaps for valid NAT/SIT sets, orphan
+ inode lists, and summary entries of current active segments.
+- NAT Block address table for all the node blocks stored in Main area.
+- SIT Segment information such as valid block count and bitmap for the
+ validity of all the blocks.
+- SSA Summary entries which contains the owner information of all the
+ data and node blocks stored in Main area.
+- Main Node and data blocks.
+
+In order to avoid misalignment between file system and flash-based storage, F2FS
+aligns the start block address of CP with the segment size. Also, it aligns the
+start block address of Main area with the zone size by reserving some segments
+in SSA area.
+
+ align with the zone size <-|
+ |-> align with the segment size
+ _________________________________________________________________________
+ | | | Node | Segment | Segment | |
+ | Superblock | Checkpoint | Address | Info. | Summary | Main |
+ | (SB) | (CP) | Table (NAT) | Table (SIT) | Area (SSA) | |
+ |____________|_____2______|______N______|______N______|______N_____|__N___|
+ . .
+ . .
+ . .
+ ._________________________________________.
+ |_Segment_|_..._|_Segment_|_..._|_Segment_|
+ . .
+ ._________._________
+ |_section_|__...__|_
+ . .
+ .________.
+ |__zone__|
+
+
+File System Metadata Structure
+------------------------------
+
+F2FS adopts the checkpointing scheme to maintain file system consistency. At the
+mount time, F2FS first tries to find the last valid checkpoint data by scanning
+CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
+One of them always indicates the last valid data, which is called as shadow copy
+mechanism. In addition to CP, NAT and SIT also adopts the shadow copy mechanism.
+
+For file system consistency, each CP points which NAT and SIT copies are valid,
+as shown as below.
+
+ +--------+----------+---------+
+ | CP | NAT | SIT |
+ +--------+----------+---------+
+ . . . .
+ . . . .
+ . . . .
+ +-------+-------+--------+--------+--------+--------+
+ | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
+ +-------+-------+--------+--------+--------+--------+
+ | ^ ^
+ | | |
+ `----------------------------------------'
+
+Index Structure
+---------------
+
+The key data structure to manage the data locations is a "node". As similar as
+traditional file structures, F2FS has three types of node: inode, direct node,
+indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
+indices, two direct node pointers, two indirect node pointers, and one double
+indirect node pointer as described below. One direct node block contains 1018
+data blocks, and one indirect node block contains also 1018 node blocks. Thus,
+One inode block (i.e., a file) covers:
+ 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
+
+ Inode block (4KB)
+ |- data (929)
+ |- direct node (2)
+ | `- data (1018)
+ |- indirect node (2)
+ | `- direct node (1018)
+ | `- data (1018)
+ `- triple indirect node (1)
+ `- indirect node (1018)
+ `- direct node (1018)
+ `- data (1018)
+
+Note that, all the node blocks are mapped by NAT, which means the location of
+each node is translated by the NAT table. In the consideration of the wandering
+tree problem, F2FS is able to cut off the propagation of node updates caused by
+leaf data writes.
+
+Directory Structure
+-------------------
+
+A directory entry occupies 11 bytes, which consists of the following attributes.
+
+- hash hash value of the file name
+- ino inode number
+- len the length of file name
+- type file type such as directory, symlink, etc
+
+A dentry block consists of 214 dentry slots and file names. There-in bitmap is
+used to represent whether each dentry is valid or not. A dentry block occupies
+4KB with the following composition.
+
+ Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
+ dentries(11 * 214 bytes) + file name (8 * 214 bytes)
+
+ [Bucket]
+ +--------------------------------+
+ |dentry block 1 | dentry block 2 |
+ +--------------------------------+
+ . .
+ . .
+ . [Dentry Block Structure: 4KB] .
+ +--------+----------+----------+------------+
+ | bitmap | reserved | dentries | file names |
+ +--------+----------+----------+------------+
+ [Dentry Block: 4KB] . .
+ . .
+ . .
+ +------+------+-----+------+
+ | hash | ino | len | type |
+ +------+------+-----+------+
+ [Dentry Structure: 11 bytes]
+
+F2FS implements multi-level hash tables for directory structure. Each level has
+a hash table with dedicated number of hash buckets as shown below. Note that,
+"A(2B)" means a bucket includes 2 data blocks.
+
+----------------------
+A : bucket
+B : block
+N : MAX_DIR_HASH_DEPTH
+----------------------
+
+level #0 | A(2B)
+ |
+level #1 | A(2B) - A(2B)
+ |
+level #2 | A(2B) - A(2B) - A(2B) - A(2B)
+ . | . . . .
+level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
+ . | . . . .
+level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
+
+The number of blocks and buckets are determined by,
+
+ ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
+ # of blocks in level #n = |
+ `- 4, Otherwise
+
+ ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2,
+ # of buckets in level #n = |
+ `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise
+
+When F2FS finds a file name in a directory, at first a hash value of the file
+name is calculated. Then, F2FS scans the hash table in level #0 to find the
+dentry consisting of the file name and its inode number. If not found, F2FS
+scans the next hash table in level #1. In this way, F2FS scans hash tables in
+each levels incrementally from 1 to N. In each levels, F2FS needs to scan only
+one bucket determined by the follow equation, which shows O(log(# of files))
+complexity.
+
+ bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
+
+In the case of file creation, F2FS finds an empty consecutive slots that covers
+the file name. F2FS searches the empty slots in the hash tables of whole levels
+from 1 to N in the same way as the lookup operation.
+
+The following figure shows an example of two cases holding children.
+ --------------> Dir <--------------
+ | |
+ child child
+
+ child - child [hole] - child
+
+ child - child - child [hole] - [hole] - child
+
+ Case 1: Case 2:
+ Number of children = 6, Number of children = 3,
+ File size = 7 File size = 7
+
+Default Block Allocation
+------------------------
+
+In runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
+and Hot/Warm/Cold data.
+
+- Hot node contains direct node blocks of directories.
+- Warm node contains direct node blocks except hot node blocks.
+- Cold node contains indirect node blocks
+- Hot data contains dentry blocks
+- Warm data contains data blocks except hot and cold data blocks
+- Cold data contains multimedia data or migrated data blocks
+
+LFS has two schemes for free space management: threaded log and copy-and-compac-
+tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices
+showing very good sequential write performance, since free segments are served
+all the time for writing new data. However, it suffers from cleaning overhead
+under high utilization. Contrarily, the threaded log scheme suffers from random
+writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
+copy-and-compaction scheme is adopted by default, but the policy is dynamically
+changed to the threaded log scheme according to the file system status.
+
+In order to align F2FS with underlying flash-based storages, F2FS allocates a
+segment in a unit of section. F2FS expects that the section size would be the
+same as the unit size of garbage collection in FTL. Furthermore, with respect
+to the mapping granularity in FTL, F2FS allocates each sections of the active
+logs from different zones as much as possible, since FTL can write the data in
+the active logs into one allocation unit according to its mapping granularity.
+
+Cleaning process
+----------------
+
+F2FS does cleaning both on demand and in the background. On-demand cleaning is
+triggered when there are not enough free segments to serve VFS calls. Background
+cleaner is operated by a kernel thread, and triggers the cleaning job when the
+system is idle.
+
+F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
+In greedy algorithm, F2FS selects a victim segment having the smallest number of
+valid blocks. In cost-benefit algorithm, F2FS selects a victim segment according
+to the segment age and the number of valid blocks in order to address log block
+thrashing problem in greedy algorithm. F2FS adopts greedy algorithm for on-demand
+cleaner, while background cleaner adopts cost-benefit algorithm.
+
+In order to identify what the data in the victim segment are valid or not, F2FS
+manages a bitmap. Each bit represents the validity of a block, and the bitmap is
+composed of a bit stream covering whole blocks in main area.
--
1.7.9.5




---
Jaegeuk Kim
Samsung



2012-10-05 19:56:46

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

On 10/05/2012 04:56 AM, 김재극 wrote:

> This adds a document describing the mount options, proc entries, usage, and
> design of Flash-Friendly File System, namely F2FS.
>
> Signed-off-by: Jaegeuk Kim <[email protected]>
> ---
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 316 insertions(+)
> create mode 100644 Documentation/filesystems/f2fs.txt
>
> diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> index 8c624a1..ce5fd46 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -48,6 +48,8 @@ ext4.txt
> - info, mount options and specifications for the Ext4 filesystem.
> files.txt
> - info on file management in the Linux kernel.
> +f2fs.txt
> + - info and mount options for the F2FS filesystem.
> fuse.txt
> - info on the Filesystem in User SpacE including mount options.
> gfs2.txt
> diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
> new file mode 100644
> index 0000000..cd3f846
> --- /dev/null
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -0,0 +1,314 @@
> +================================================================================
> +WHAT IS Flash-Friendly File System (F2FS)?
> +================================================================================
> +
> +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> +been widely being used for ranging from mobile to server systems. Since they are




been widely used for devices ranging from
or
been widely used for storage ranging from

> +known to have different characteristics from the conventional rotational disks,
> +a file system, an upper layer to the storage device, should adapt to the changes
> +from the sketch.




from the start.
?

> +
> +F2FS is a file system exploiting NAND flash memory-based storage devices, which
> +is based on Log-structured File System (LFS). The design has been focused on
> +addressing the fundamental issues in LFS, which are snowball effect of wandering
> +tree and high cleaning overhead.
> +
> +Since a NAND flash memory-based storage device shows different characteristic
> +according to its internal geometry or flash memory management scheme aka FTL,
> +F2FS and its tools support various parameters not only for configuring on-disk
> +layout, but also for selecting allocation and cleaning algorithms.
> +
> +The file system formatting tool, "mkfs.f2fs", is available from the following
> +download page: http://sourceforge.net/projects/f2fs-tools/
> +
> +================================================================================
> +MOUNT OPTIONS
> +================================================================================
> +
> +background_gc_off Turn off the cleaning operation, aka garbage collection,




Some people won't know what "aka" means, so how about:

Turn off the cleaning operation [garbage collection]

> + in background triggered when I/O subsystem is idle.
> +disable_roll_forward Disable the roll-forward recovery routine during SPOR.




what is SPOR?

> +discard Issue discard/TRIM commands when a segment is cleaned.
> +no_heap Disable heap-style segment allocation in which finds free




drop: in

> + segments for data from the beginning of main area, while
> + for node from the end of main area.
> +nouser_xattr Disable Extened User Attributes. Note: xattr is enabled




Extended

> + by default if CONFIG_F2FS_FS_XATTR is selected.
> +noacl Disable POSIX Access Control List. Note: acl is enabled
> + by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
> +
> +================================================================================
> +PROC ENTRIES
> +================================================================================
> +
> +/proc/fs/f2fs/ contains information about partitions mounted as f2fs. For each
> +partition, a corresponding directory, named as its device name, is provided with
> +the following proc entries.
> +
> +- f2fs_stat major file system information managed by f2fs currently
> +- f2fs_sit_stat average SIT information about whole segments




what is SIT?

> +- f2fs_mem_stat current memory footprint consumed by f2fs
> +
> +e.g., in /proc/fs/f2fs/sdb1/
> +
> +================================================================================
> +USAGE
> +================================================================================
> +
> +1. Download userland tools
> +
> +2. Insmod f2fs.ko module:
> + # insmod f2fs.ko
> +
> +3. Check the directory trying to mount
> + # mkdir /mnt/f2fs
> +
> +4. Format the block device, and then mount as f2fs
> + # mkfs.f2fs -l label /dev/block_device
> + # mount -t f2fs /dev/block_device /mnt/f2fs
> +
> +================================================================================
> +DESIGN
> +================================================================================
> +
> +On-disk Layout
> +--------------
> +
> +F2FS divides whole volume into a number of segments each of which size is 2MB by




divides the whole volume into a number of segments, each of which is
2MB in size by default.

> +default. A section is composed of consecutive segments, and a zone consists of a
> +set of sections.
> +
> +F2FS maintains logically six log areas. Except SB, all the log areas are managed
> +in a unit of multiple segments. SB is located at the beggining of the partition,




beginning

> +and there exist two superblocks to avoid file system crash. Other file system
> +metadata such as CP, NAT, SIT, and SSA are located in front part of the volume.




in the front part

> +Main area contains file and directory data including their indices.
> +
> +Each area manages the following contents.
> +- CP File system information, bitmaps for valid NAT/SIT sets, orphan
> + inode lists, and summary entries of current active segments.
> +- NAT Block address table for all the node blocks stored in Main area.
> +- SIT Segment information such as valid block count and bitmap for the
> + validity of all the blocks.
> +- SSA Summary entries which contains the owner information of all the
> + data and node blocks stored in Main area.
> +- Main Node and data blocks.
> +
> +In order to avoid misalignment between file system and flash-based storage, F2FS
> +aligns the start block address of CP with the segment size. Also, it aligns the
> +start block address of Main area with the zone size by reserving some segments
> +in SSA area.
> +
> + align with the zone size <-|
> + |-> align with the segment size
> + _________________________________________________________________________
> + | | | Node | Segment | Segment | |
> + | Superblock | Checkpoint | Address | Info. | Summary | Main |
> + | (SB) | (CP) | Table (NAT) | Table (SIT) | Area (SSA) | |
> + |____________|_____2______|______N______|______N______|______N_____|__N___|
> + . .
> + . .
> + . .
> + ._________________________________________.
> + |_Segment_|_..._|_Segment_|_..._|_Segment_|
> + . .
> + ._________._________
> + |_section_|__...__|_
> + . .
> + .________.
> + |__zone__|
> +
> +
> +File System Metadata Structure
> +------------------------------
> +
> +F2FS adopts the checkpointing scheme to maintain file system consistency. At the




drop: the

> +mount time, F2FS first tries to find the last valid checkpoint data by scanning
> +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
> +One of them always indicates the last valid data, which is called as shadow copy

> +mechanism. In addition to CP, NAT and SIT also adopts the shadow copy mechanism.




adopt

> +
> +For file system consistency, each CP points which NAT and SIT copies are valid,




points to which

> +as shown as below.
> +
> + +--------+----------+---------+
> + | CP | NAT | SIT |
> + +--------+----------+---------+
> + . . . .
> + . . . .
> + . . . .
> + +-------+-------+--------+--------+--------+--------+
> + | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
> + +-------+-------+--------+--------+--------+--------+
> + | ^ ^
> + | | |
> + `----------------------------------------'
> +
> +Index Structure
> +---------------
> +
> +The key data structure to manage the data locations is a "node". As similar as




Similar to

> +traditional file structures, F2FS has three types of node: inode, direct node,
> +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block




which

> +indices, two direct node pointers, two indirect node pointers, and one double
> +indirect node pointer as described below. One direct node block contains 1018
> +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> +One inode block (i.e., a file) covers:




one

> + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> +
> + Inode block (4KB)
> + |- data (929)
> + |- direct node (2)
> + | `- data (1018)
> + |- indirect node (2)
> + | `- direct node (1018)
> + | `- data (1018)
> + `- triple indirect node (1)
> + `- indirect node (1018)
> + `- direct node (1018)
> + `- data (1018)
> +
> +Note that, all the node blocks are mapped by NAT, which means the location of




drop: ,

> +each node is translated by the NAT table. In the consideration of the wandering
> +tree problem, F2FS is able to cut off the propagation of node updates caused by
> +leaf data writes.
> +
> +Directory Structure
> +-------------------
> +
> +A directory entry occupies 11 bytes, which consists of the following attributes.
> +
> +- hash hash value of the file name
> +- ino inode number
> +- len the length of file name
> +- type file type such as directory, symlink, etc
> +
> +A dentry block consists of 214 dentry slots and file names. There-in bitmap is




maybe: Therein a bitmap is

> +used to represent whether each dentry is valid or not. A dentry block occupies
> +4KB with the following composition.
> +
> + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
> + dentries(11 * 214 bytes) + file name (8 * 214 bytes)
> +
> + [Bucket]
> + +--------------------------------+
> + |dentry block 1 | dentry block 2 |
> + +--------------------------------+
> + . .
> + . .
> + . [Dentry Block Structure: 4KB] .
> + +--------+----------+----------+------------+
> + | bitmap | reserved | dentries | file names |
> + +--------+----------+----------+------------+
> + [Dentry Block: 4KB] . .
> + . .
> + . .
> + +------+------+-----+------+
> + | hash | ino | len | type |
> + +------+------+-----+------+
> + [Dentry Structure: 11 bytes]
> +
> +F2FS implements multi-level hash tables for directory structure. Each level has
> +a hash table with dedicated number of hash buckets as shown below. Note that,




drop: ,

> +"A(2B)" means a bucket includes 2 data blocks.
> +
> +----------------------
> +A : bucket
> +B : block
> +N : MAX_DIR_HASH_DEPTH
> +----------------------
> +
> +level #0 | A(2B)
> + |
> +level #1 | A(2B) - A(2B)
> + |
> +level #2 | A(2B) - A(2B) - A(2B) - A(2B)
> + . | . . . .
> +level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
> + . | . . . .
> +level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
> +
> +The number of blocks and buckets are determined by,
> +
> + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
> + # of blocks in level #n = |
> + `- 4, Otherwise
> +
> + ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2,
> + # of buckets in level #n = |
> + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise
> +

> +When F2FS finds a file name in a directory, at first a hash value of the file

> +name is calculated. Then, F2FS scans the hash table in level #0 to find the
> +dentry consisting of the file name and its inode number. If not found, F2FS
> +scans the next hash table in level #1. In this way, F2FS scans hash tables in
> +each levels incrementally from 1 to N. In each levels, F2FS needs to scan only


level level,

> +one bucket determined by the follow equation, which shows O(log(# of files))


following

> +complexity.
> +
> + bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
> +
> +In the case of file creation, F2FS finds an empty consecutive slots that covers


drop: an that cover

> +the file name. F2FS searches the empty slots in the hash tables of whole levels
> +from 1 to N in the same way as the lookup operation.
> +
> +The following figure shows an example of two cases holding children.
> + --------------> Dir <--------------
> + | |
> + child child
> +
> + child - child [hole] - child
> +
> + child - child - child [hole] - [hole] - child
> +
> + Case 1: Case 2:
> + Number of children = 6, Number of children = 3,
> + File size = 7 File size = 7
> +
> +Default Block Allocation
> +------------------------
> +
> +In runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node


At runtime,

> +and Hot/Warm/Cold data.
> +
> +- Hot node contains direct node blocks of directories.
> +- Warm node contains direct node blocks except hot node blocks.
> +- Cold node contains indirect node blocks
> +- Hot data contains dentry blocks
> +- Warm data contains data blocks except hot and cold data blocks
> +- Cold data contains multimedia data or migrated data blocks
> +
> +LFS has two schemes for free space management: threaded log and copy-and-compac-
> +tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices


"aka" usage is not nice IMO.

> +showing very good sequential write performance, since free segments are served
> +all the time for writing new data. However, it suffers from cleaning overhead
> +under high utilization. Contrarily, the threaded log scheme suffers from random
> +writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
> +copy-and-compaction scheme is adopted by default, but the policy is dynamically
> +changed to the threaded log scheme according to the file system status.
> +
> +In order to align F2FS with underlying flash-based storages, F2FS allocates a


storage,

> +segment in a unit of section. F2FS expects that the section size would be the
> +same as the unit size of garbage collection in FTL. Furthermore, with respect
> +to the mapping granularity in FTL, F2FS allocates each sections of the active


section

> +logs from different zones as much as possible, since FTL can write the data in
> +the active logs into one allocation unit according to its mapping granularity.
> +
> +Cleaning process
> +----------------
> +
> +F2FS does cleaning both on demand and in the background. On-demand cleaning is
> +triggered when there are not enough free segments to serve VFS calls. Background
> +cleaner is operated by a kernel thread, and triggers the cleaning job when the
> +system is idle.
> +
> +F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
> +In greedy algorithm, F2FS selects a victim segment having the smallest number of


In the greedy

> +valid blocks. In cost-benefit algorithm, F2FS selects a victim segment according


In the cost-benefit

> +to the segment age and the number of valid blocks in order to address log block
> +thrashing problem in greedy algorithm. F2FS adopts greedy algorithm for on-demand


in the greedy adopts the greedy


> +cleaner, while background cleaner adopts cost-benefit algorithm.


adopts the

> +
> +In order to identify what the data in the victim segment are valid or not, F2FS


if the data
or
whether the data

> +manages a bitmap. Each bit represents the validity of a block, and the bitmap is
> +composed of a bit stream covering whole blocks in main area.





--
~Randy

2012-10-06 07:19:16

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

> This adds a document describing the mount options, proc entries, usage, and
> design of Flash-Friendly File System, namely F2FS.
>
> Signed-off-by: Jaegeuk Kim <jaegeuk.kim <at> samsung.com>
> ---
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 316 insertions(+)
> create mode 100644 Documentation/filesystems/f2fs.txt
>
> diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> index 8c624a1..ce5fd46 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -48,6 +48,8 @@ ext4.txt
> - info, mount options and specifications for the Ext4 filesystem.
> files.txt
> - info on file management in the Linux kernel.
> +f2fs.txt
> + - info and mount options for the F2FS filesystem.
> fuse.txt
> - info on the Filesystem in User SpacE including mount options.
> gfs2.txt
> diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
> new file mode 100644
> index 0000000..cd3f846
> --- /dev/null
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -0,0 +1,314 @@
> +================================================================================
> +WHAT IS Flash-Friendly File System (F2FS)?
> +================================================================================
> +
> +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> +been widely being used for ranging from mobile to server systems. Since they are
> +known to have different characteristics from the conventional rotational disks,
> +a file system, an upper layer to the storage device, should adapt to the changes
> +from the sketch.
> +
> +F2FS is a file system exploiting NAND flash memory-based storage devices, which
> +is based on Log-structured File System (LFS). The design has been focused on
> +addressing the fundamental issues in LFS, which are snowball effect of wandering
> +tree and high cleaning overhead.
> +
> +Since a NAND flash memory-based storage device shows different characteristic
> +according to its internal geometry or flash memory management scheme aka FTL,
> +F2FS and its tools support various parameters not only for configuring on-disk
> +layout, but also for selecting allocation and cleaning algorithms.

This is pretty high-level, can you list the main F2FS design points that are
optimized for NAND flash characteristics?

First I thought it's log-structured so it automatically performs write
wear-leveling. But F2FS is intended to be used on top of FTL? So the FTL
already handles that, and also it appears F2FS is a hybrid between append-only
and write in-place.

Who will choose "various parameters" and select "allocation and cleaning
algorithms" appropriate for the device? I wouldn't know what to parameter
values to use.

> +Index Structure
> +---------------
> +
> +The key data structure to manage the data locations is a "node". As similar as
> +traditional file structures, F2FS has three types of node: inode, direct node,
> +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
> +indices, two direct node pointers, two indirect node pointers, and one double
> +indirect node pointer as described below. One direct node block contains 1018
> +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> +One inode block (i.e., a file) covers:
> + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> +
> + Inode block (4KB)
> + |- data (929)
> + |- direct node (2)
> + | `- data (1018)
> + |- indirect node (2)
> + | `- direct node (1018)
> + | `- data (1018)
> + `- triple indirect node (1)
> + `- indirect node (1018)
> + `- direct node (1018)
> + `- data (1018)

Earlier it says "one double indirect node pointer" but this diagram shows a
"triple indirect node". The diagram itself suggests this should really be
"double indirect node" because it points to 1 indirect node.

Stefan

2012-10-06 18:40:50

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

I'll apply this.
Thanks,

2012-10-05 (금), 12:56 -0700, Randy Dunlap:
> On 10/05/2012 04:56 AM, 김재극 wrote:
>
> > This adds a document describing the mount options, proc entries, usage, and
> > design of Flash-Friendly File System, namely F2FS.
> >
> > Signed-off-by: Jaegeuk Kim <[email protected]>
> > ---
> > Documentation/filesystems/00-INDEX | 2 +
> > Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 316 insertions(+)
> > create mode 100644 Documentation/filesystems/f2fs.txt
> >
> > diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> > index 8c624a1..ce5fd46 100644
> > --- a/Documentation/filesystems/00-INDEX
> > +++ b/Documentation/filesystems/00-INDEX
> > @@ -48,6 +48,8 @@ ext4.txt
> > - info, mount options and specifications for the Ext4 filesystem.
> > files.txt
> > - info on file management in the Linux kernel.
> > +f2fs.txt
> > + - info and mount options for the F2FS filesystem.
> > fuse.txt
> > - info on the Filesystem in User SpacE including mount options.
> > gfs2.txt
> > diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
> > new file mode 100644
> > index 0000000..cd3f846
> > --- /dev/null
> > +++ b/Documentation/filesystems/f2fs.txt
> > @@ -0,0 +1,314 @@
> > +================================================================================
> > +WHAT IS Flash-Friendly File System (F2FS)?
> > +================================================================================
> > +
> > +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> > +been widely being used for ranging from mobile to server systems. Since they are
>
>
>
>
> been widely used for devices ranging from
> or
> been widely used for storage ranging from
>
> > +known to have different characteristics from the conventional rotational disks,
> > +a file system, an upper layer to the storage device, should adapt to the changes
> > +from the sketch.
>
>
>
>
> from the start.
> ?
>
> > +
> > +F2FS is a file system exploiting NAND flash memory-based storage devices, which
> > +is based on Log-structured File System (LFS). The design has been focused on
> > +addressing the fundamental issues in LFS, which are snowball effect of wandering
> > +tree and high cleaning overhead.
> > +
> > +Since a NAND flash memory-based storage device shows different characteristic
> > +according to its internal geometry or flash memory management scheme aka FTL,
> > +F2FS and its tools support various parameters not only for configuring on-disk
> > +layout, but also for selecting allocation and cleaning algorithms.
> > +
> > +The file system formatting tool, "mkfs.f2fs", is available from the following
> > +download page: http://sourceforge.net/projects/f2fs-tools/
> > +
> > +================================================================================
> > +MOUNT OPTIONS
> > +================================================================================
> > +
> > +background_gc_off Turn off the cleaning operation, aka garbage collection,
>
>
>
>
> Some people won't know what "aka" means, so how about:
>
> Turn off the cleaning operation [garbage collection]
>
> > + in background triggered when I/O subsystem is idle.
> > +disable_roll_forward Disable the roll-forward recovery routine during SPOR.
>
>
>
>
> what is SPOR?
>
> > +discard Issue discard/TRIM commands when a segment is cleaned.
> > +no_heap Disable heap-style segment allocation in which finds free
>
>
>
>
> drop: in
>
> > + segments for data from the beginning of main area, while
> > + for node from the end of main area.
> > +nouser_xattr Disable Extened User Attributes. Note: xattr is enabled
>
>
>
>
> Extended
>
> > + by default if CONFIG_F2FS_FS_XATTR is selected.
> > +noacl Disable POSIX Access Control List. Note: acl is enabled
> > + by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
> > +
> > +================================================================================
> > +PROC ENTRIES
> > +================================================================================
> > +
> > +/proc/fs/f2fs/ contains information about partitions mounted as f2fs. For each
> > +partition, a corresponding directory, named as its device name, is provided with
> > +the following proc entries.
> > +
> > +- f2fs_stat major file system information managed by f2fs currently
> > +- f2fs_sit_stat average SIT information about whole segments
>
>
>
>
> what is SIT?
>
> > +- f2fs_mem_stat current memory footprint consumed by f2fs
> > +
> > +e.g., in /proc/fs/f2fs/sdb1/
> > +
> > +================================================================================
> > +USAGE
> > +================================================================================
> > +
> > +1. Download userland tools
> > +
> > +2. Insmod f2fs.ko module:
> > + # insmod f2fs.ko
> > +
> > +3. Check the directory trying to mount
> > + # mkdir /mnt/f2fs
> > +
> > +4. Format the block device, and then mount as f2fs
> > + # mkfs.f2fs -l label /dev/block_device
> > + # mount -t f2fs /dev/block_device /mnt/f2fs
> > +
> > +================================================================================
> > +DESIGN
> > +================================================================================
> > +
> > +On-disk Layout
> > +--------------
> > +
> > +F2FS divides whole volume into a number of segments each of which size is 2MB by
>
>
>
>
> divides the whole volume into a number of segments, each of which is
> 2MB in size by default.
>
> > +default. A section is composed of consecutive segments, and a zone consists of a
> > +set of sections.
> > +
> > +F2FS maintains logically six log areas. Except SB, all the log areas are managed
> > +in a unit of multiple segments. SB is located at the beggining of the partition,
>
>
>
>
> beginning
>
> > +and there exist two superblocks to avoid file system crash. Other file system
> > +metadata such as CP, NAT, SIT, and SSA are located in front part of the volume.
>
>
>
>
> in the front part
>
> > +Main area contains file and directory data including their indices.
> > +
> > +Each area manages the following contents.
> > +- CP File system information, bitmaps for valid NAT/SIT sets, orphan
> > + inode lists, and summary entries of current active segments.
> > +- NAT Block address table for all the node blocks stored in Main area.
> > +- SIT Segment information such as valid block count and bitmap for the
> > + validity of all the blocks.
> > +- SSA Summary entries which contains the owner information of all the
> > + data and node blocks stored in Main area.
> > +- Main Node and data blocks.
> > +
> > +In order to avoid misalignment between file system and flash-based storage, F2FS
> > +aligns the start block address of CP with the segment size. Also, it aligns the
> > +start block address of Main area with the zone size by reserving some segments
> > +in SSA area.
> > +
> > + align with the zone size <-|
> > + |-> align with the segment size
> > + _________________________________________________________________________
> > + | | | Node | Segment | Segment | |
> > + | Superblock | Checkpoint | Address | Info. | Summary | Main |
> > + | (SB) | (CP) | Table (NAT) | Table (SIT) | Area (SSA) | |
> > + |____________|_____2______|______N______|______N______|______N_____|__N___|
> > + . .
> > + . .
> > + . .
> > + ._________________________________________.
> > + |_Segment_|_..._|_Segment_|_..._|_Segment_|
> > + . .
> > + ._________._________
> > + |_section_|__...__|_
> > + . .
> > + .________.
> > + |__zone__|
> > +
> > +
> > +File System Metadata Structure
> > +------------------------------
> > +
> > +F2FS adopts the checkpointing scheme to maintain file system consistency. At the
>
>
>
>
> drop: the
>
> > +mount time, F2FS first tries to find the last valid checkpoint data by scanning
> > +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
> > +One of them always indicates the last valid data, which is called as shadow copy
>
> > +mechanism. In addition to CP, NAT and SIT also adopts the shadow copy mechanism.
>
>
>
>
> adopt
>
> > +
> > +For file system consistency, each CP points which NAT and SIT copies are valid,
>
>
>
>
> points to which
>
> > +as shown as below.
> > +
> > + +--------+----------+---------+
> > + | CP | NAT | SIT |
> > + +--------+----------+---------+
> > + . . . .
> > + . . . .
> > + . . . .
> > + +-------+-------+--------+--------+--------+--------+
> > + | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
> > + +-------+-------+--------+--------+--------+--------+
> > + | ^ ^
> > + | | |
> > + `----------------------------------------'
> > +
> > +Index Structure
> > +---------------
> > +
> > +The key data structure to manage the data locations is a "node". As similar as
>
>
>
>
> Similar to
>
> > +traditional file structures, F2FS has three types of node: inode, direct node,
> > +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
>
>
>
>
> which
>
> > +indices, two direct node pointers, two indirect node pointers, and one double
> > +indirect node pointer as described below. One direct node block contains 1018
> > +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> > +One inode block (i.e., a file) covers:
>
>
>
>
> one
>
> > + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> > +
> > + Inode block (4KB)
> > + |- data (929)
> > + |- direct node (2)
> > + | `- data (1018)
> > + |- indirect node (2)
> > + | `- direct node (1018)
> > + | `- data (1018)
> > + `- triple indirect node (1)
> > + `- indirect node (1018)
> > + `- direct node (1018)
> > + `- data (1018)
> > +
> > +Note that, all the node blocks are mapped by NAT, which means the location of
>
>
>
>
> drop: ,
>
> > +each node is translated by the NAT table. In the consideration of the wandering
> > +tree problem, F2FS is able to cut off the propagation of node updates caused by
> > +leaf data writes.
> > +
> > +Directory Structure
> > +-------------------
> > +
> > +A directory entry occupies 11 bytes, which consists of the following attributes.
> > +
> > +- hash hash value of the file name
> > +- ino inode number
> > +- len the length of file name
> > +- type file type such as directory, symlink, etc
> > +
> > +A dentry block consists of 214 dentry slots and file names. There-in bitmap is
>
>
>
>
> maybe: Therein a bitmap is
>
> > +used to represent whether each dentry is valid or not. A dentry block occupies
> > +4KB with the following composition.
> > +
> > + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
> > + dentries(11 * 214 bytes) + file name (8 * 214 bytes)
> > +
> > + [Bucket]
> > + +--------------------------------+
> > + |dentry block 1 | dentry block 2 |
> > + +--------------------------------+
> > + . .
> > + . .
> > + . [Dentry Block Structure: 4KB] .
> > + +--------+----------+----------+------------+
> > + | bitmap | reserved | dentries | file names |
> > + +--------+----------+----------+------------+
> > + [Dentry Block: 4KB] . .
> > + . .
> > + . .
> > + +------+------+-----+------+
> > + | hash | ino | len | type |
> > + +------+------+-----+------+
> > + [Dentry Structure: 11 bytes]
> > +
> > +F2FS implements multi-level hash tables for directory structure. Each level has
> > +a hash table with dedicated number of hash buckets as shown below. Note that,
>
>
>
>
> drop: ,
>
> > +"A(2B)" means a bucket includes 2 data blocks.
> > +
> > +----------------------
> > +A : bucket
> > +B : block
> > +N : MAX_DIR_HASH_DEPTH
> > +----------------------
> > +
> > +level #0 | A(2B)
> > + |
> > +level #1 | A(2B) - A(2B)
> > + |
> > +level #2 | A(2B) - A(2B) - A(2B) - A(2B)
> > + . | . . . .
> > +level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
> > + . | . . . .
> > +level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
> > +
> > +The number of blocks and buckets are determined by,
> > +
> > + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
> > + # of blocks in level #n = |
> > + `- 4, Otherwise
> > +
> > + ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2,
> > + # of buckets in level #n = |
> > + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise
> > +
>
> > +When F2FS finds a file name in a directory, at first a hash value of the file
>
> > +name is calculated. Then, F2FS scans the hash table in level #0 to find the
> > +dentry consisting of the file name and its inode number. If not found, F2FS
> > +scans the next hash table in level #1. In this way, F2FS scans hash tables in
> > +each levels incrementally from 1 to N. In each levels, F2FS needs to scan only
>
>
> level level,
>
> > +one bucket determined by the follow equation, which shows O(log(# of files))
>
>
> following
>
> > +complexity.
> > +
> > + bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
> > +
> > +In the case of file creation, F2FS finds an empty consecutive slots that covers
>
>
> drop: an that cover
>
> > +the file name. F2FS searches the empty slots in the hash tables of whole levels
> > +from 1 to N in the same way as the lookup operation.
> > +
> > +The following figure shows an example of two cases holding children.
> > + --------------> Dir <--------------
> > + | |
> > + child child
> > +
> > + child - child [hole] - child
> > +
> > + child - child - child [hole] - [hole] - child
> > +
> > + Case 1: Case 2:
> > + Number of children = 6, Number of children = 3,
> > + File size = 7 File size = 7
> > +
> > +Default Block Allocation
> > +------------------------
> > +
> > +In runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
>
>
> At runtime,
>
> > +and Hot/Warm/Cold data.
> > +
> > +- Hot node contains direct node blocks of directories.
> > +- Warm node contains direct node blocks except hot node blocks.
> > +- Cold node contains indirect node blocks
> > +- Hot data contains dentry blocks
> > +- Warm data contains data blocks except hot and cold data blocks
> > +- Cold data contains multimedia data or migrated data blocks
> > +
> > +LFS has two schemes for free space management: threaded log and copy-and-compac-
> > +tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices
>
>
> "aka" usage is not nice IMO.
>
> > +showing very good sequential write performance, since free segments are served
> > +all the time for writing new data. However, it suffers from cleaning overhead
> > +under high utilization. Contrarily, the threaded log scheme suffers from random
> > +writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
> > +copy-and-compaction scheme is adopted by default, but the policy is dynamically
> > +changed to the threaded log scheme according to the file system status.
> > +
> > +In order to align F2FS with underlying flash-based storages, F2FS allocates a
>
>
> storage,
>
> > +segment in a unit of section. F2FS expects that the section size would be the
> > +same as the unit size of garbage collection in FTL. Furthermore, with respect
> > +to the mapping granularity in FTL, F2FS allocates each sections of the active
>
>
> section
>
> > +logs from different zones as much as possible, since FTL can write the data in
> > +the active logs into one allocation unit according to its mapping granularity.
> > +
> > +Cleaning process
> > +----------------
> > +
> > +F2FS does cleaning both on demand and in the background. On-demand cleaning is
> > +triggered when there are not enough free segments to serve VFS calls. Background
> > +cleaner is operated by a kernel thread, and triggers the cleaning job when the
> > +system is idle.
> > +
> > +F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
> > +In greedy algorithm, F2FS selects a victim segment having the smallest number of
>
>
> In the greedy
>
> > +valid blocks. In cost-benefit algorithm, F2FS selects a victim segment according
>
>
> In the cost-benefit
>
> > +to the segment age and the number of valid blocks in order to address log block
> > +thrashing problem in greedy algorithm. F2FS adopts greedy algorithm for on-demand
>
>
> in the greedy adopts the greedy
>
>
> > +cleaner, while background cleaner adopts cost-benefit algorithm.
>
>
> adopts the
>
> > +
> > +In order to identify what the data in the victim segment are valid or not, F2FS
>
>
> if the data
> or
> whether the data
>
> > +manages a bitmap. Each bit represents the validity of a block, and the bitmap is
> > +composed of a bit stream covering whole blocks in main area.
>
>
>
>
>

--
Jaegeuk Kim
Samsung

2012-10-06 19:40:50

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

2012-10-06 (토), 09:19 +0200, Stefan Hajnoczi:
> > This adds a document describing the mount options, proc entries, usage, and
> > design of Flash-Friendly File System, namely F2FS.
> >
> > Signed-off-by: Jaegeuk Kim <jaegeuk.kim <at> samsung.com>
> > ---
> > Documentation/filesystems/00-INDEX | 2 +
> > Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 316 insertions(+)
> > create mode 100644 Documentation/filesystems/f2fs.txt
> >
> > diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
> > index 8c624a1..ce5fd46 100644
> > --- a/Documentation/filesystems/00-INDEX
> > +++ b/Documentation/filesystems/00-INDEX
> > @@ -48,6 +48,8 @@ ext4.txt
> > - info, mount options and specifications for the Ext4 filesystem.
> > files.txt
> > - info on file management in the Linux kernel.
> > +f2fs.txt
> > + - info and mount options for the F2FS filesystem.
> > fuse.txt
> > - info on the Filesystem in User SpacE including mount options.
> > gfs2.txt
> > diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
> > new file mode 100644
> > index 0000000..cd3f846
> > --- /dev/null
> > +++ b/Documentation/filesystems/f2fs.txt
> > @@ -0,0 +1,314 @@
> > +================================================================================
> > +WHAT IS Flash-Friendly File System (F2FS)?
> > +================================================================================
> > +
> > +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
> > +been widely being used for ranging from mobile to server systems. Since they are
> > +known to have different characteristics from the conventional rotational disks,
> > +a file system, an upper layer to the storage device, should adapt to the changes
> > +from the sketch.
> > +
> > +F2FS is a file system exploiting NAND flash memory-based storage devices, which
> > +is based on Log-structured File System (LFS). The design has been focused on
> > +addressing the fundamental issues in LFS, which are snowball effect of wandering
> > +tree and high cleaning overhead.
> > +
> > +Since a NAND flash memory-based storage device shows different characteristic
> > +according to its internal geometry or flash memory management scheme aka FTL,
> > +F2FS and its tools support various parameters not only for configuring on-disk
> > +layout, but also for selecting allocation and cleaning algorithms.
>
> This is pretty high-level, can you list the main F2FS design points that are
> optimized for NAND flash characteristics?

Ok, I'll summarize and supplement some major features in v2.

>
> First I thought it's log-structured so it automatically performs write
> wear-leveling. But F2FS is intended to be used on top of FTL?

Yes, F2FS works on top of FTL. The main goal of f2fs is to produce IOs
optimized to FTL as a best effort. So, I expect that F2FS would mitigate
the wear-out problem in FTL helpfully.

> So the FTL
> already handles that, and also it appears F2FS is a hybrid between append-only
> and write in-place.

Yes, F2FS is a hybrid, but in-place-writes will be occurred only when
there is not enough free space. So normally, it happens append-only.

Anyway, do you concern the trade-off between wear-leveling and
in-place-writes?
IMHO, for better wear-leveling in FTL, file system cannot help writing
data in append-only as much as possible and aligning management units
between fs and FTL. I think that would be sufficiently effective to
reduce the wear-leveling overhead.

>
> Who will choose "various parameters" and select "allocation and cleaning
> algorithms" appropriate for the device?

I think that the parameters will be determined mainly by system vendors.
They select flash chips for products so that they able to know also what
parameters are preferable from the chip vendor.
But, I think the default parameters are common enoughly for usual
environment.

On the other hand, configuring allocation and cleaning algorithms will
be useful to researchers. It is obivous that the default algorithms may
not be best according to various workloads, so I'd like to give a
facility to easily add any new algorithms.

> I wouldn't know what to parameter
> values to use.
>

You can find the parameters from mount/format options in detail.

> > +Index Structure
> > +---------------
> > +
> > +The key data structure to manage the data locations is a "node". As similar as
> > +traditional file structures, F2FS has three types of node: inode, direct node,
> > +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
> > +indices, two direct node pointers, two indirect node pointers, and one double
> > +indirect node pointer as described below. One direct node block contains 1018
> > +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> > +One inode block (i.e., a file) covers:
> > + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> > +
> > + Inode block (4KB)
> > + |- data (929)
> > + |- direct node (2)
> > + | `- data (1018)
> > + |- indirect node (2)
> > + | `- direct node (1018)
> > + | `- data (1018)
> > + `- triple indirect node (1)
> > + `- indirect node (1018)
> > + `- direct node (1018)
> > + `- data (1018)
>
> Earlier it says "one double indirect node pointer" but this diagram shows a
> "triple indirect node". The diagram itself suggests this should really be
> "double indirect node" because it points to 1 indirect node.

Nice catch.
I'll correct this.
Thanks,

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Jaegeuk Kim
Samsung

2012-10-07 06:20:58

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

On Sat, Oct 6, 2012 at 9:40 PM, Jaegeuk Kim <[email protected]> wrote:
> 2012-10-06 (토), 09:19 +0200, Stefan Hajnoczi:
>> > This adds a document describing the mount options, proc entries, usage, and
>> > design of Flash-Friendly File System, namely F2FS.
>> >
>> > Signed-off-by: Jaegeuk Kim <jaegeuk.kim <at> samsung.com>
>> > ---
>> > Documentation/filesystems/00-INDEX | 2 +
>> > Documentation/filesystems/f2fs.txt | 314 ++++++++++++++++++++++++++++++++++++
>> > 2 files changed, 316 insertions(+)
>> > create mode 100644 Documentation/filesystems/f2fs.txt
>> >
>> > diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
>> > index 8c624a1..ce5fd46 100644
>> > --- a/Documentation/filesystems/00-INDEX
>> > +++ b/Documentation/filesystems/00-INDEX
>> > @@ -48,6 +48,8 @@ ext4.txt
>> > - info, mount options and specifications for the Ext4 filesystem.
>> > files.txt
>> > - info on file management in the Linux kernel.
>> > +f2fs.txt
>> > + - info and mount options for the F2FS filesystem.
>> > fuse.txt
>> > - info on the Filesystem in User SpacE including mount options.
>> > gfs2.txt
>> > diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
>> > new file mode 100644
>> > index 0000000..cd3f846
>> > --- /dev/null
>> > +++ b/Documentation/filesystems/f2fs.txt
>> > @@ -0,0 +1,314 @@
>> > +================================================================================
>> > +WHAT IS Flash-Friendly File System (F2FS)?
>> > +================================================================================
>> > +
>> > +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
>> > +been widely being used for ranging from mobile to server systems. Since they are
>> > +known to have different characteristics from the conventional rotational disks,
>> > +a file system, an upper layer to the storage device, should adapt to the changes
>> > +from the sketch.
>> > +
>> > +F2FS is a file system exploiting NAND flash memory-based storage devices, which
>> > +is based on Log-structured File System (LFS). The design has been focused on
>> > +addressing the fundamental issues in LFS, which are snowball effect of wandering
>> > +tree and high cleaning overhead.
>> > +
>> > +Since a NAND flash memory-based storage device shows different characteristic
>> > +according to its internal geometry or flash memory management scheme aka FTL,
>> > +F2FS and its tools support various parameters not only for configuring on-disk
>> > +layout, but also for selecting allocation and cleaning algorithms.
>>
>> This is pretty high-level, can you list the main F2FS design points that are
>> optimized for NAND flash characteristics?
>
> Ok, I'll summarize and supplement some major features in v2.

Thanks!

>>
>> First I thought it's log-structured so it automatically performs write
>> wear-leveling. But F2FS is intended to be used on top of FTL?
>
> Yes, F2FS works on top of FTL. The main goal of f2fs is to produce IOs
> optimized to FTL as a best effort. So, I expect that F2FS would mitigate
> the wear-out problem in FTL helpfully.
>
>> So the FTL
>> already handles that, and also it appears F2FS is a hybrid between append-only
>> and write in-place.
>
> Yes, F2FS is a hybrid, but in-place-writes will be occurred only when
> there is not enough free space. So normally, it happens append-only.
>
> Anyway, do you concern the trade-off between wear-leveling and
> in-place-writes?
> IMHO, for better wear-leveling in FTL, file system cannot help writing
> data in append-only as much as possible and aligning management units
> between fs and FTL. I think that would be sufficiently effective to
> reduce the wear-leveling overhead.

Okay, thanks for explaining. Sounds like F2FS makes it easy for the
FTL without taking over the responsibility of wear-leveling
completely.

>>
>> Who will choose "various parameters" and select "allocation and cleaning
>> algorithms" appropriate for the device?
>
> I think that the parameters will be determined mainly by system vendors.
> They select flash chips for products so that they able to know also what
> parameters are preferable from the chip vendor.
> But, I think the default parameters are common enoughly for usual
> environment.

If the default parameters are usually fine then great. I was thinking
about formatting existing SD cards with F2FS.

Stefan

2012-10-13 18:36:03

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

On Friday 05 October 2012 20:56:06 김재극 wrote:
> This adds a document describing the mount options, proc entries, usage, and
> design of Flash-Friendly File System, namely F2FS.

Nice document!

This is a lot like the design I came up with for an experimental file system
last year, so I'll try to go through and talk about the differences to
what I had. I believe there are a few points where you are missing
optimizations that I had and that could be added.

> +================================================================================
> +PROC ENTRIES
> +================================================================================
> +
> +/proc/fs/f2fs/ contains information about partitions mounted as f2fs. For each
> +partition, a corresponding directory, named as its device name, is provided with
> +the following proc entries.
> +
> +- f2fs_stat major file system information managed by f2fs currently
> +- f2fs_sit_stat average SIT information about whole segments
> +- f2fs_mem_stat current memory footprint consumed by f2fs
> +
> +e.g., in /proc/fs/f2fs/sdb1/

Unrelated to the file system design, these should not be part of procfs at
all. You can probably convert them to debugfs files easily though and perhaps
even simplify the implementation in the process (I haven't looked at how it's
done at the moment).

> +================================================================================
> +DESIGN
> +================================================================================
> +
> +On-disk Layout
> +--------------
> +
> +F2FS divides whole volume into a number of segments each of which size is 2MB by
> +default. A section is composed of consecutive segments, and a zone consists of a
> +set of sections.

A segment size of 2MB should work fine for most eMMC, but there are a lot of
devices using TLC (three bit per cell MLC) flash with 1.5/3/6/12 MB erase blocks,
in particular cheap SD cards. Your design fundamentally won't work on those if you
only allow power-of-two multiples of 2 MB.

Would it be possible to change the segment size to 1.5 MB as a mkfs option?
I realize that the space utilization would be slightly worse in that case,
but if we allow power-of-two multiples of either 1.5 or 2 MB, that should cover
almost all media on the market today, and most likely for the forseeable
future.

> +F2FS maintains logically six log areas. Except SB, all the log areas are managed
> +in a unit of multiple segments. SB is located at the beggining of the partition,
> +and there exist two superblocks to avoid file system crash. Other file system
> +metadata such as CP, NAT, SIT, and SSA are located in front part of the volume.
> +Main area contains file and directory data including their indices.

This is almost identical to what I had come up with.

The main difference is the number of log areas. Given that you need seven erase
blocks (six log areas, plus a random access area), quite a number of devices
are excluded from working ideally. Making use of six log areas is definitely
a win if you manage to split the data well between hot/warm/cold blocks, but
on devices that can only handle 5, this is also clearly a loss.

I understand that all Samsung eMMC handles this well, and so would the Samsung
Class 10 'Plus' SDHC cards. The Samsung Class 4 'Essential' cards can only do
2 blocks of 6 MB plus the FAT area that you need for random access, so that
won't work well with your current design.

Excluding some of your main competitor's eMMC seems more problematic. My view
is that if we include the file system, it should fundamentally work across
all devices in the industry. It's definitely ok to use 6 log areas by default
to optimize for sane devices like yours, but please consider offering at
least using just 4 log areas as an mkfs option in order to include all other
eMMC and the main SD card brands (many Sandisk, Lexar, and a bunch of the
noname ones). I don't think it's worth including the Toshiba/Phison/Kingston
SD card controllers though, as they are basically broken for use with a
proper file system anyway and attempting to support a single log area (plus
random area in the front) would require tradeoffs that sacrifice performance
on proper devices.

> +Each area manages the following contents.
> +- CP File system information, bitmaps for valid NAT/SIT sets, orphan
> + inode lists, and summary entries of current active segments.
> +- NAT Block address table for all the node blocks stored in Main area.
> +- SIT Segment information such as valid block count and bitmap for the
> + validity of all the blocks.
> +- SSA Summary entries which contains the owner information of all the
> + data and node blocks stored in Main area.
> +- Main Node and data blocks.
> +
> +In order to avoid misalignment between file system and flash-based storage, F2FS
> +aligns the start block address of CP with the segment size. Also, it aligns the
> +start block address of Main area with the zone size by reserving some segments
> +in SSA area.

What do you do if the partition itself is misaligned? Do you align the start block
address to the start of the partition or the start of the device?

> +File System Metadata Structure
> +------------------------------
> +
> +F2FS adopts the checkpointing scheme to maintain file system consistency. At the
> +mount time, F2FS first tries to find the last valid checkpoint data by scanning
> +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
> +One of them always indicates the last valid data, which is called as shadow copy
> +mechanism. In addition to CP, NAT and SIT also adopts the shadow copy mechanism.
> +
> +For file system consistency, each CP points which NAT and SIT copies are valid,
> +as shown as below.
> +
> + +--------+----------+---------+
> + | CP | NAT | SIT |
> + +--------+----------+---------+
> + . . . .
> + . . . .
> + . . . .
> + +-------+-------+--------+--------+--------+--------+
> + | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
> + +-------+-------+--------+--------+--------+--------+
> + | ^ ^
> + | | |
> + `----------------------------------------'

There is one small difference to my design: instead of having the checkpoint area
in the front, the idea was to only record a pointer to the currently used segments
in the global metadata and treat the segment that carries the inodes as a journal.
The main advantage of that is that you only have to write a single block to
atomically update an inode, possibly including embedded data, while you always
have to write the inode first, and then make an update to the CP area to make
it visible.

> +Index Structure
> +---------------
> +
> +The key data structure to manage the data locations is a "node". As similar as
> +traditional file structures, F2FS has three types of node: inode, direct node,
> +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
> +indices, two direct node pointers, two indirect node pointers, and one double
> +indirect node pointer as described below. One direct node block contains 1018
> +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> +One inode block (i.e., a file) covers:
> + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> +
> + Inode block (4KB)
> + |- data (929)
> + |- direct node (2)
> + | `- data (1018)
> + |- indirect node (2)
> + | `- direct node (1018)
> + | `- data (1018)
> + `- triple indirect node (1)
> + `- indirect node (1018)
> + `- direct node (1018)
> + `- data (1018)

You use two different naming schemes here: you call the last-level indirect
block the "double indirect" node pointer in the text, but the "triple
indirect" node pointer in the illustration.

> +Note that, all the node blocks are mapped by NAT, which means the location of
> +each node is translated by the NAT table. In the consideration of the wandering
> +tree problem, F2FS is able to cut off the propagation of node updates caused by
> +leaf data writes.

Ah, that is very clever!

My solution to the same problem was to have a gigantic fan-out in the inode
in order to only ever need one level of indirection:

* I make the cluster that is addressed by a pointer larger than 4k by
aligning it to e.g. the flash page size, typically 16KB to 64KB.
* in a 64KB inode, you can fit some 16000 pointers
* each pointer then points to a cluster of that size, so you get up to
64KB * 16000 = 1GB file size with this method on 64 KB clusters, or
16KB * 4000 = 64 MB file size on 16 KB clusters, or just under 4 MB
file size on 4 KB clusters.
* To allow larger files, a flag in the inode signifies that the pointers
address entire segments instead of clusters, which allows files up
to the partition size even with 4 KB clusters on most media (erase block
size is usually coupled to drive size)

To compare the two approaches:

+ f2fs does not need a clever algorithm to determine when to pick a
segment indirect layout over a cluster indirect, and does not need
to copy data when a file grows over the cut-off
+ f2fs does not waste an average of half a segment for large files.
- large files are less efficient to read because they are not in
contiguous segment chunks
- synchronous updates of indirect files require four writes (data, indirect
block, inode, NAT) rather than just two as in my approach (data,
inode journal).
- f2fs has no embedded data in very small files, though this should be easy
to add.
+/- random writes in database files get turned into log-structured writes,
which is often more efficient to write, but also means more fragmentation
and the user application cannot optimize the accesses for the underlying
storage as a lot of databases do.

> +Default Block Allocation
> +------------------------
> +
> +In runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
> +and Hot/Warm/Cold data.
> +
> +- Hot node contains direct node blocks of directories.
> +- Warm node contains direct node blocks except hot node blocks.
> +- Cold node contains indirect node blocks
> +- Hot data contains dentry blocks
> +- Warm data contains data blocks except hot and cold data blocks
> +- Cold data contains multimedia data or migrated data blocks

This categorization looks like you could easily reduce the number of active
logs if the hardware cannot handle all six, e.g.:

1 hardware segment: not easily possible, I suppose
2 segments: node + data
3 segments: node + hot data + warm/cold data
4 segments: hot node + warm/cold node + hot data + warm/cold data
5 segments: hot node + warm/cold node + hot data + warm data + cold data

> +LFS has two schemes for free space management: threaded log and copy-and-compac-
> +tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices
> +showing very good sequential write performance, since free segments are served
> +all the time for writing new data. However, it suffers from cleaning overhead
> +under high utilization. Contrarily, the threaded log scheme suffers from random
> +writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
> +copy-and-compaction scheme is adopted by default, but the policy is dynamically
> +changed to the threaded log scheme according to the file system status.

Compared to my design, this uses the exact same two methods, but applies them
in different scenarios: I would always use the copy-and-compaction method
for small and medium sized files (up to a few megabytes) but use in-place
overwriting in large files, similar to (but not the same as) your threaded
log.

Using the threaded log sounds like a good idea to use when there is little
free space. I had briefly considered that but thought it would be too
hard to fit in with my design.

What do you think about adding support for in-place updates on large files
as I described? I believe this would be a win in most cases, although
it is clearly workload dependent.

> +In order to align F2FS with underlying flash-based storages, F2FS allocates a
> +segment in a unit of section. F2FS expects that the section size would be the
> +same as the unit size of garbage collection in FTL. Furthermore, with respect
> +to the mapping granularity in FTL, F2FS allocates each sections of the active
> +logs from different zones as much as possible, since FTL can write the data in
> +the active logs into one allocation unit according to its mapping granularity.

I did not know about the benefit of using different zones, but it makes a lot
of sense now that you mention it.

> +Cleaning process
> +----------------
> +
> +F2FS does cleaning both on demand and in the background. On-demand cleaning is
> +triggered when there are not enough free segments to serve VFS calls. Background
> +cleaner is operated by a kernel thread, and triggers the cleaning job when the
> +system is idle.
> +
> +F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
> +In greedy algorithm, F2FS selects a victim segment having the smallest number of
> +valid blocks. In cost-benefit algorithm, F2FS selects a victim segment according
> +to the segment age and the number of valid blocks in order to address log block
> +thrashing problem in greedy algorithm. F2FS adopts greedy algorithm for on-demand
> +cleaner, while background cleaner adopts cost-benefit algorithm.
> +
> +In order to identify what the data in the victim segment are valid or not, F2FS
> +manages a bitmap. Each bit represents the validity of a block, and the bitmap is
> +composed of a bit stream covering whole blocks in main area.

Do you take the type of data into account in the cleaner? In particular, do you
try to group blocks together in a segment based on any of:

* file modification age,
* access patterns seen on the file (random vs linear writes),
* multiple degrees of cold/hot files
* file name (e.g. extension),
* file size,
* directory/data/inode/block-pointer blocks
* ...

I spent a lot of time thinking about possible optimizations based on these but have
not actually done any simulations on what would be a good strategy. I'm curious
to learn if you considered the above.

Arnd

2012-10-14 06:40:41

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

2012-10-13 (토), 18:35 +0000, Arnd Bergmann:
> On Friday 05 October 2012 20:56:06 김재극 wrote:
> > This adds a document describing the mount options, proc entries, usage, and
> > design of Flash-Friendly File System, namely F2FS.
>
> Nice document!
>
> This is a lot like the design I came up with for an experimental file system
> last year, so I'll try to go through and talk about the differences to
> what I had. I believe there are a few points where you are missing
> optimizations that I had and that could be added.
>
> > +================================================================================
> > +PROC ENTRIES
> > +================================================================================
> > +
> > +/proc/fs/f2fs/ contains information about partitions mounted as f2fs. For each
> > +partition, a corresponding directory, named as its device name, is provided with
> > +the following proc entries.
> > +
> > +- f2fs_stat major file system information managed by f2fs currently
> > +- f2fs_sit_stat average SIT information about whole segments
> > +- f2fs_mem_stat current memory footprint consumed by f2fs
> > +
> > +e.g., in /proc/fs/f2fs/sdb1/
>
> Unrelated to the file system design, these should not be part of procfs at
> all. You can probably convert them to debugfs files easily though and perhaps
> even simplify the implementation in the process (I haven't looked at how it's
> done at the moment).

As greg pointed out this before, IMHO, this information is used
not for debugging f2fs, but just showing the current status to users.
So, I put this in procfs.

But, as both of you and greg recommend this, I'd better reexamine the
proc entries in consideration of debugfs at this moment.

>
> > +================================================================================
> > +DESIGN
> > +================================================================================
> > +
> > +On-disk Layout
> > +--------------
> > +
> > +F2FS divides whole volume into a number of segments each of which size is 2MB by
> > +default. A section is composed of consecutive segments, and a zone consists of a
> > +set of sections.
>
> A segment size of 2MB should work fine for most eMMC, but there are a lot of
> devices using TLC (three bit per cell MLC) flash with 1.5/3/6/12 MB erase blocks,
> in particular cheap SD cards. Your design fundamentally won't work on those if you
> only allow power-of-two multiples of 2 MB.
>
> Would it be possible to change the segment size to 1.5 MB as a mkfs option?
> I realize that the space utilization would be slightly worse in that case,
> but if we allow power-of-two multiples of either 1.5 or 2 MB, that should cover
> almost all media on the market today, and most likely for the forseeable
> future.

Yes, definitely right.
In the f2fs design, a segment with 2MB is a basic unit to manage the
whole storage space, but not to align with the erase block size.
Instead, we have a section for alignment.
If the erase block size is 1.5MB, I think it is worth to set the
section size to 6MB.
The only thing that I have to do is to support configuring the section
size as just multiples not power-of-two ones.
How do you think about this?

>
> > +F2FS maintains logically six log areas. Except SB, all the log areas are managed
> > +in a unit of multiple segments. SB is located at the beggining of the partition,
> > +and there exist two superblocks to avoid file system crash. Other file system
> > +metadata such as CP, NAT, SIT, and SSA are located in front part of the volume.
> > +Main area contains file and directory data including their indices.
>
> This is almost identical to what I had come up with.
>
> The main difference is the number of log areas. Given that you need seven erase
> blocks (six log areas, plus a random access area), quite a number of devices
> are excluded from working ideally. Making use of six log areas is definitely
> a win if you manage to split the data well between hot/warm/cold blocks, but
> on devices that can only handle 5, this is also clearly a loss.
>
> I understand that all Samsung eMMC handles this well, and so would the Samsung
> Class 10 'Plus' SDHC cards. The Samsung Class 4 'Essential' cards can only do
> 2 blocks of 6 MB plus the FAT area that you need for random access, so that
> won't work well with your current design.
>
> Excluding some of your main competitor's eMMC seems more problematic. My view
> is that if we include the file system, it should fundamentally work across
> all devices in the industry. It's definitely ok to use 6 log areas by default
> to optimize for sane devices like yours, but please consider offering at
> least using just 4 log areas as an mkfs option in order to include all other
> eMMC and the main SD card brands (many Sandisk, Lexar, and a bunch of the
> noname ones). I don't think it's worth including the Toshiba/Phison/Kingston
> SD card controllers though, as they are basically broken for use with a
> proper file system anyway and attempting to support a single log area (plus
> random area in the front) would require tradeoffs that sacrifice performance
> on proper devices.

Yes, good point.
Actually, apart from the device side, I've concerned for a long time how
many logs should be needed and how to use them especially in order to
reduce the cleaning overhead efficiently in f2fs.
Finally, I got six logs.

As you mentioned, it's ok to support 4 log areas as an option.
But, what I'm concerning is the cleaning overhead.
In terms of write amplification, is it really beneficial to reduce the
number of logs for flash storage even though the cleaning overhead
increases?

BTW, if the cleaning overhead in 4 logs is not different from that in 6
logs, absolutely we should use 4 logs.
This is basically dependant on the workload, and I think it needs more
research.

>
> > +Each area manages the following contents.
> > +- CP File system information, bitmaps for valid NAT/SIT sets, orphan
> > + inode lists, and summary entries of current active segments.
> > +- NAT Block address table for all the node blocks stored in Main area.
> > +- SIT Segment information such as valid block count and bitmap for the
> > + validity of all the blocks.
> > +- SSA Summary entries which contains the owner information of all the
> > + data and node blocks stored in Main area.
> > +- Main Node and data blocks.
> > +
> > +In order to avoid misalignment between file system and flash-based storage, F2FS
> > +aligns the start block address of CP with the segment size. Also, it aligns the
> > +start block address of Main area with the zone size by reserving some segments
> > +in SSA area.
>
> What do you do if the partition itself is misaligned? Do you align the start block
> address to the start of the partition or the start of the device?

Yes, we got the start of the partition by ioctl.

>
> > +File System Metadata Structure
> > +------------------------------
> > +
> > +F2FS adopts the checkpointing scheme to maintain file system consistency. At the
> > +mount time, F2FS first tries to find the last valid checkpoint data by scanning
> > +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
> > +One of them always indicates the last valid data, which is called as shadow copy
> > +mechanism. In addition to CP, NAT and SIT also adopts the shadow copy mechanism.
> > +
> > +For file system consistency, each CP points which NAT and SIT copies are valid,
> > +as shown as below.
> > +
> > + +--------+----------+---------+
> > + | CP | NAT | SIT |
> > + +--------+----------+---------+
> > + . . . .
> > + . . . .
> > + . . . .
> > + +-------+-------+--------+--------+--------+--------+
> > + | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
> > + +-------+-------+--------+--------+--------+--------+
> > + | ^ ^
> > + | | |
> > + `----------------------------------------'
>
> There is one small difference to my design: instead of having the checkpoint area
> in the front, the idea was to only record a pointer to the currently used segments
> in the global metadata and treat the segment that carries the inodes as a journal.
> The main advantage of that is that you only have to write a single block to
> atomically update an inode, possibly including embedded data, while you always
> have to write the inode first, and then make an update to the CP area to make
> it visible.

As I understand, I think your design is not different from f2fs.
The checkpoint in f2fs is used for the start of power-off-recovery,
which contains current active segment numbers for six logs.
And each direct node block including inode blocks contains the next log
pointer (i.e., block address) in order to support roll-forward.
I think this feature is also a kind of journaling inodes described by
you, which means that f2fs does not need to update the CP area whenever
inode is written or even if fsync is called.

>
> > +Index Structure
> > +---------------
> > +
> > +The key data structure to manage the data locations is a "node". As similar as
> > +traditional file structures, F2FS has three types of node: inode, direct node,
> > +indirect node. F2FS assigns 4KB to an inode block where contains 929 data block
> > +indices, two direct node pointers, two indirect node pointers, and one double
> > +indirect node pointer as described below. One direct node block contains 1018
> > +data blocks, and one indirect node block contains also 1018 node blocks. Thus,
> > +One inode block (i.e., a file) covers:
> > + 4KB * (929 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
> > +
> > + Inode block (4KB)
> > + |- data (929)
> > + |- direct node (2)
> > + | `- data (1018)
> > + |- indirect node (2)
> > + | `- direct node (1018)
> > + | `- data (1018)
> > + `- triple indirect node (1)
> > + `- indirect node (1018)
> > + `- direct node (1018)
> > + `- data (1018)
>
> You use two different naming schemes here: you call the last-level indirect
> block the "double indirect" node pointer in the text, but the "triple
> indirect" node pointer in the illustration.

Stefan pointed out before. :)
I'll change this.

> > +Note that, all the node blocks are mapped by NAT, which means the location of
> > +each node is translated by the NAT table. In the consideration of the wandering
> > +tree problem, F2FS is able to cut off the propagation of node updates caused by
> > +leaf data writes.
>
> Ah, that is very clever!
>
> My solution to the same problem was to have a gigantic fan-out in the inode
> in order to only ever need one level of indirection:
>
> * I make the cluster that is addressed by a pointer larger than 4k by
> aligning it to e.g. the flash page size, typically 16KB to 64KB.
> * in a 64KB inode, you can fit some 16000 pointers
> * each pointer then points to a cluster of that size, so you get up to
> 64KB * 16000 = 1GB file size with this method on 64 KB clusters, or
> 16KB * 4000 = 64 MB file size on 16 KB clusters, or just under 4 MB
> file size on 4 KB clusters.
> * To allow larger files, a flag in the inode signifies that the pointers
> address entire segments instead of clusters, which allows files up
> to the partition size even with 4 KB clusters on most media (erase block
> size is usually coupled to drive size)
>

IMHO, it's an interesting approach to eliminate the indirection.
It seems to adopt large block size and an extent approach in a hybrid
manner.

> To compare the two approaches:
>
> + f2fs does not need a clever algorithm to determine when to pick a
> segment indirect layout over a cluster indirect, and does not need
> to copy data when a file grows over the cut-off
> + f2fs does not waste an average of half a segment for large files.
> - large files are less efficient to read because they are not in
> contiguous segment chunks

Right.
But, in most cases, I've seen that the read performance can be improved
by readahead in VFS.

> - synchronous updates of indirect files require four writes (data, indirect
> block, inode, NAT) rather than just two as in my approach (data,
> inode journal).

I don't know exactly your approach, but, does it need to write inode_map
finally?
Normally in the case of fsync calls, f2fs writes data and direct node
block only, since the data can be recovered by roll-forward.
Instead, flusher will write the indirect node blocks afterwards.

> - f2fs has no embedded data in very small files, though this should be easy
> to add.

Yes, vyacheslav pointed out before.
At this moment, I'd like to remain this feature for future.
Instead, I totally would like to focus on validation and stabilization
on many unexpected platforms.

> +/- random writes in database files get turned into log-structured writes,
> which is often more efficient to write, but also means more fragmentation
> and the user application cannot optimize the accesses for the underlying
> storage as a lot of databases do.
>
> > +Default Block Allocation
> > +------------------------
> > +
> > +In runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
> > +and Hot/Warm/Cold data.
> > +
> > +- Hot node contains direct node blocks of directories.
> > +- Warm node contains direct node blocks except hot node blocks.
> > +- Cold node contains indirect node blocks
> > +- Hot data contains dentry blocks
> > +- Warm data contains data blocks except hot and cold data blocks
> > +- Cold data contains multimedia data or migrated data blocks
>
> This categorization looks like you could easily reduce the number of active
> logs if the hardware cannot handle all six, e.g.:
>
> 1 hardware segment: not easily possible, I suppose
> 2 segments: node + data
> 3 segments: node + hot data + warm/cold data
> 4 segments: hot node + warm/cold node + hot data + warm/cold data
> 5 segments: hot node + warm/cold node + hot data + warm data + cold data
>

Yes, several options can be implemented easily.
But, as I mentioned above, I'd like to discuss in more detail.

> > +LFS has two schemes for free space management: threaded log and copy-and-compac-
> > +tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices
> > +showing very good sequential write performance, since free segments are served
> > +all the time for writing new data. However, it suffers from cleaning overhead
> > +under high utilization. Contrarily, the threaded log scheme suffers from random
> > +writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
> > +copy-and-compaction scheme is adopted by default, but the policy is dynamically
> > +changed to the threaded log scheme according to the file system status.
>
> Compared to my design, this uses the exact same two methods, but applies them
> in different scenarios: I would always use the copy-and-compaction method
> for small and medium sized files (up to a few megabytes) but use in-place
> overwriting in large files, similar to (but not the same as) your threaded
> log.
>
> Using the threaded log sounds like a good idea to use when there is little
> free space. I had briefly considered that but thought it would be too
> hard to fit in with my design.
>
> What do you think about adding support for in-place updates on large files
> as I described? I believe this would be a win in most cases, although
> it is clearly workload dependent.

I also considered writing data in place, and sometimes it can be a good
choice instead of reusing obsolete space.
So, I implemented this in f2fs, but I disabled in the current design,
since I suspect that there is a recovery problem.
Due to the roll-back model, I'm not sure we can overwrite data before
checkpoint.
How do you think?

>
> > +In order to align F2FS with underlying flash-based storages, F2FS allocates a
> > +segment in a unit of section. F2FS expects that the section size would be the
> > +same as the unit size of garbage collection in FTL. Furthermore, with respect
> > +to the mapping granularity in FTL, F2FS allocates each sections of the active
> > +logs from different zones as much as possible, since FTL can write the data in
> > +the active logs into one allocation unit according to its mapping granularity.
>
> I did not know about the benefit of using different zones, but it makes a lot
> of sense now that you mention it.
>
> > +Cleaning process
> > +----------------
> > +
> > +F2FS does cleaning both on demand and in the background. On-demand cleaning is
> > +triggered when there are not enough free segments to serve VFS calls. Background
> > +cleaner is operated by a kernel thread, and triggers the cleaning job when the
> > +system is idle.
> > +
> > +F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
> > +In greedy algorithm, F2FS selects a victim segment having the smallest number of
> > +valid blocks. In cost-benefit algorithm, F2FS selects a victim segment according
> > +to the segment age and the number of valid blocks in order to address log block
> > +thrashing problem in greedy algorithm. F2FS adopts greedy algorithm for on-demand
> > +cleaner, while background cleaner adopts cost-benefit algorithm.
> > +
> > +In order to identify what the data in the victim segment are valid or not, F2FS
> > +manages a bitmap. Each bit represents the validity of a block, and the bitmap is
> > +composed of a bit stream covering whole blocks in main area.
>
> Do you take the type of data into account in the cleaner? In particular, do you
> try to group blocks together in a segment based on any of:
>
> * file modification age,
> * access patterns seen on the file (random vs linear writes),
> * multiple degrees of cold/hot files
> * file name (e.g. extension),
> * file size,
> * directory/data/inode/block-pointer blocks
> * ...
>
> I spent a lot of time thinking about possible optimizations based on these but have
> not actually done any simulations on what would be a good strategy. I'm curious
> to learn if you considered the above.

In f2fs, a background cleaner loads victim valid blocks and marks them
as dirty in the page cache. Then, flusher will write them with user
data to the proper log areas determined by their types.
An on-demand cleaner loads and moves the victim blocks right away to the
logs also determined by their types.
In the background cleaner, a victim section is selected by a
cost-benefit function which takes into account section ages.

In summary, f2fs uses the following information to try group blocks
in order to reduce the cleaning overhead.

* segment age,
* file name (extension),
* directory/data/inode/block-pointer blocks

Maybe there are another optimization items.
But as you know, if we try to optimize somthing, we have to add or
maintain additional information, resulting in an overhead.
We'd better take more discussions on this issue.

Thank you very much.

>
> Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Jaegeuk Kim
Samsung

2012-10-14 12:13:42

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 01/16] f2fs: add document

On Sunday 14 October 2012, Jaegeuk Kim wrote:
> 2012-10-13 (토), 18:35 +0000, Arnd Bergmann:

> > > +================================================================================
> > > +DESIGN
> > > +================================================================================
> > > +
> > > +On-disk Layout
> > > +--------------
> > > +
> > > +F2FS divides whole volume into a number of segments each of which size is 2MB by
> > > +default. A section is composed of consecutive segments, and a zone consists of a
> > > +set of sections.
> >
> > A segment size of 2MB should work fine for most eMMC, but there are a lot of
> > devices using TLC (three bit per cell MLC) flash with 1.5/3/6/12 MB erase blocks,
> > in particular cheap SD cards. Your design fundamentally won't work on those if you
> > only allow power-of-two multiples of 2 MB.
> >
> > Would it be possible to change the segment size to 1.5 MB as a mkfs option?
> > I realize that the space utilization would be slightly worse in that case,
> > but if we allow power-of-two multiples of either 1.5 or 2 MB, that should cover
> > almost all media on the market today, and most likely for the forseeable
> > future.
>
> Yes, definitely right.
> In the f2fs design, a segment with 2MB is a basic unit to manage the
> whole storage space, but not to align with the erase block size.
> Instead, we have a section for alignment.
> If the erase block size is 1.5MB, I think it is worth to set the
> section size to 6MB.
> The only thing that I have to do is to support configuring the section
> size as just multiples not power-of-two ones.
> How do you think about this?

I think I was confusing the terms 'section' and 'segment' here, I thought the
segment was the larger of the two, sorry about that.

For the 2011 generation of SD cards, changing the segment size to 1.5 MB would
probably give a better performance, as it reduces the overhead from garbage
collection compared to using 6 MB sections with 2 MB segments.

In future devices, I assume we will not see erase blocks under 4 or 6 MB,
so using arbitrary section sizes of multiples of 2 MB should indeed be
good enough. It will also help us if we ever see 5-bit-per-cell flash
and need 10 MB erase blocks, although I have not seen those on any roadmaps
so far.

> Yes, good point.
> Actually, apart from the device side, I've concerned for a long time how
> many logs should be needed and how to use them especially in order to
> reduce the cleaning overhead efficiently in f2fs.
> Finally, I got six logs.
>
> As you mentioned, it's ok to support 4 log areas as an option.
> But, what I'm concerning is the cleaning overhead.
> In terms of write amplification, is it really beneficial to reduce the
> number of logs for flash storage even though the cleaning overhead
> increases?
>
> BTW, if the cleaning overhead in 4 logs is not different from that in 6
> logs, absolutely we should use 4 logs.
> This is basically dependant on the workload, and I think it needs more
> research.

I would expect that using 6 rather than 4 is almost always a measurable win
when the hardware supports it, but a much bigger loss if the the hardware
cannot support it.

I believe we can measure the second case pretty well using Tixy's flashsim
tool from http://yxit.co.uk/public/flash-performance/ by taking a blocktrace
of the current code using 6 logs, and then simulating it on devices with
7 and 5 open erase blocks, respectively. On the former case it should
report a write amplification fact of close to '1' for the FTL, in the latter
case it will be significantly higher than that but also workload dependent.

> > > +Each area manages the following contents.
> > > +- CP File system information, bitmaps for valid NAT/SIT sets, orphan
> > > + inode lists, and summary entries of current active segments.
> > > +- NAT Block address table for all the node blocks stored in Main area.
> > > +- SIT Segment information such as valid block count and bitmap for the
> > > + validity of all the blocks.
> > > +- SSA Summary entries which contains the owner information of all the
> > > + data and node blocks stored in Main area.
> > > +- Main Node and data blocks.
> > > +
> > > +In order to avoid misalignment between file system and flash-based storage, F2FS
> > > +aligns the start block address of CP with the segment size. Also, it aligns the
> > > +start block address of Main area with the zone size by reserving some segments
> > > +in SSA area.
> >
> > What do you do if the partition itself is misaligned? Do you align the start block
> > address to the start of the partition or the start of the device?
>
> Yes, we got the start of the partition by ioctl.

Ok, good. Is it possible to override this manually? I have seen (very few) devices in the
past on which the first erase block is smaller than the others and the others are all
unaligned as a consequence. It's probably not worth supporting if this is a lot of work,
but if you think it's easy enough to add a command line option, it may come in handy.

FWIW, I have also seen devices that have a slightly different erase block size, namely
4128 (rather than 4096) KB, but I don't see any way to reasonably support that (or even
detect it programatically) at all.

> > There is one small difference to my design: instead of having the checkpoint area
> > in the front, the idea was to only record a pointer to the currently used segments
> > in the global metadata and treat the segment that carries the inodes as a journal.
> > The main advantage of that is that you only have to write a single block to
> > atomically update an inode, possibly including embedded data, while you always
> > have to write the inode first, and then make an update to the CP area to make
> > it visible.
>
> As I understand, I think your design is not different from f2fs.
> The checkpoint in f2fs is used for the start of power-off-recovery,
> which contains current active segment numbers for six logs.
> And each direct node block including inode blocks contains the next log
> pointer (i.e., block address) in order to support roll-forward.
> I think this feature is also a kind of journaling inodes described by
> you, which means that f2fs does not need to update the CP area whenever
> inode is written or even if fsync is called.

Ok, I see.

> > - synchronous updates of indirect files require four writes (data, indirect
> > block, inode, NAT) rather than just two as in my approach (data,
> > inode journal).
>
> I don't know exactly your approach, but, does it need to write inode_map
> finally?

Yes, the inode map is written every time one section is filled with
inodes.

> Normally in the case of fsync calls, f2fs writes data and direct node
> block only, since the data can be recovered by roll-forward.
> Instead, flusher will write the indirect node blocks afterwards.

Ok, good. So it's actually much more similar than I thought.

> > - f2fs has no embedded data in very small files, though this should be easy
> > to add.
>
> Yes, vyacheslav pointed out before.
> At this moment, I'd like to remain this feature for future.
> Instead, I totally would like to focus on validation and stabilization
> on many unexpected platforms.

Fair enough. This can be added as a backwards compatible (not forward
compatible) extension later. Maybe it's good to formalize the layout
for those files already though and reserve a flag in the inode for it.

> > > +LFS has two schemes for free space management: threaded log and copy-and-compac-
> > > +tion. The copy-and-compaction scheme, aka cleaning, is well-suited for devices
> > > +showing very good sequential write performance, since free segments are served
> > > +all the time for writing new data. However, it suffers from cleaning overhead
> > > +under high utilization. Contrarily, the threaded log scheme suffers from random
> > > +writes, but no cleaning process is needed. F2FS adopts a hybrid scheme where the
> > > +copy-and-compaction scheme is adopted by default, but the policy is dynamically
> > > +changed to the threaded log scheme according to the file system status.
> >
> > Compared to my design, this uses the exact same two methods, but applies them
> > in different scenarios: I would always use the copy-and-compaction method
> > for small and medium sized files (up to a few megabytes) but use in-place
> > overwriting in large files, similar to (but not the same as) your threaded
> > log.
> >
> > Using the threaded log sounds like a good idea to use when there is little
> > free space. I had briefly considered that but thought it would be too
> > hard to fit in with my design.
> >
> > What do you think about adding support for in-place updates on large files
> > as I described? I believe this would be a win in most cases, although
> > it is clearly workload dependent.
>
> I also considered writing data in place, and sometimes it can be a good
> choice instead of reusing obsolete space.
> So, I implemented this in f2fs, but I disabled in the current design,
> since I suspect that there is a recovery problem.
> Due to the roll-back model, I'm not sure we can overwrite data before
> checkpoint.
> How do you think?

In-place updates in general can be hard for two reaons I see:

* writes that are smaller than the page size might not be done
atomically on low-quality media, and more generally writing to one
file might during power-fail might cause contents of another file to
get lost (write-disturb).

* ordering between data and metadata updates is harder.

I think in the specific case that my design has where in-place
updates are only done in segment-indirect files, neither of these
is a serious problem. The write-disturb case here will only ever
have the chance to destroy contents of the same file, and only
for hardware that doesn't handle this already. This is still
better than any other file system we have can do. Further, you never
really have to update metadata for an in-place update in these
cases, since you do not need to maintain per-segment summaries.

> > Do you take the type of data into account in the cleaner? In particular, do you
> > try to group blocks together in a segment based on any of:
> >
> > * file modification age,
> > * access patterns seen on the file (random vs linear writes),
> > * multiple degrees of cold/hot files
> > * file name (e.g. extension),
> > * file size,
> > * directory/data/inode/block-pointer blocks
> > * ...
> >
> > I spent a lot of time thinking about possible optimizations based on these but have
> > not actually done any simulations on what would be a good strategy. I'm curious
> > to learn if you considered the above.
>
> In f2fs, a background cleaner loads victim valid blocks and marks them
> as dirty in the page cache. Then, flusher will write them with user
> data to the proper log areas determined by their types.
> An on-demand cleaner loads and moves the victim blocks right away to the
> logs also determined by their types.
> In the background cleaner, a victim section is selected by a
> cost-benefit function which takes into account section ages.
>
> In summary, f2fs uses the following information to try group blocks
> in order to reduce the cleaning overhead.
>
> * segment age,
> * file name (extension),
> * directory/data/inode/block-pointer blocks
>
> Maybe there are another optimization items.
> But as you know, if we try to optimize somthing, we have to add or
> maintain additional information, resulting in an overhead.
> We'd better take more discussions on this issue.

Loading blocks to immediately mark them dirty is a nice simplification
of the algorithm that avoids adding long latencies for the gargabe
collection phase as it gets intermixed with regular writes.
I intentionally did the opposite though: the algorithm for the prototype
as implemented by Volker Schneider is:

* when one segment is filled with data (nodes or user data), determine
whether garbage collection should be done or not.
* if we should do garbage collection, decide on one type of data (by
some form of categorization), and copy data from one old segment
into a new segment
* until the new segment is full, keep adding more data of the same kind
from other segments that should be garbage collected.

This incurs a noticeable write latency every time we go into GC, basically
the time to read and write an entire segment, but it also reduces
fragmentation by grouping similar data together more than your approach,
and it means that we never mix new data with old data from GC into the
same segments. It also lets us record the segment age more accurately
for the new segment in GC, because the age of the data is still what
it was, unlike data that gets freshly written.

Arnd