LinuxLists.cc - [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

2012-10-31 09:35:34

[permalink] [raw]

Subject: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Change log from v2:

o Fix compilation error for arm [Max]
o Move proc entries to debugfs [Greg]
o Add i_atime, i_generation, etc [Neil]
o Support NFS export [Changman]
o Move the f2fs magic number [Marco]
o Add s_time_gran [Marco]
o Fix f2fs_truncate [Marco]
o Enhance f2fs document [Vyacheslav]
o Support uuid and add additional comments [Vyacheslav]
o Change superblock offset [Vyacheslav]
o Fix some bugs and return values [Vyacheslav]
o Improve initial mount time [Jaegeuk]
o Fix f2fs-tools environment [Mike]

I've set up a git tree in sf.net for f2fs-tools.
Please download f2fs-tools by tag: for_patch_set_v3.

--------------------------------------------------------------------------------
Change log from v1:

o Apply the recent user namespace changes [Eric]
o Remove unnecessary condition check [Al]
o Fix wrong description [Stefan]
o Fix f2fs document [Randy]
o Enlarge the volume label length to 256 unicodes [Martin]
o Support time resolution to nano scale [Boaz]
o Fix the wrong use of endian conversion [David]
o Fix the use of mutex and spinlocks [David]
o Remove the use of __GFP_NOFAIL, etc [Neil]
o Change the flow for readability [Neil]
o Reduce the lock contention in CP [Neil]
o Support multiples of section size [Arnd]
o Support configurable extension list [Arnd]
o Support configurable active log numbers [Arnd]

[Furture works]
o Aware of file access pattern
o Erase block indirect
o Sub-page write avoidance
o in-line data
o xattr optimization/in-line xattrs

I really appreciate the valuable comments from all of you in community.

--------------------------------------------------------------------------------

This is a new patch set for the f2fs file system.

What is F2FS?
=============

NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
been widely being used for ranging from mobile to server systems. Since they are
known to have different characteristics from the conventional rotational disks,
a file system, an upper layer to the storage device, should adapt to the changes
from the sketch.

F2FS is a new file system carefully designed for the NAND flash memory-based storage
devices. We chose a log structure file system approach, but we tried to adapt it
to the new form of storage. Also we remedy some known issues of the very old log
structured file system, such as snowball effect of wandering tree and high cleaning
overhead.

Because a NAND-based storage device shows different characteristics according to
its internal geometry or flash memory management scheme aka FTL, we add various
parameters not only for configuring on-disk layout, but also for selecting allocation
and cleaning algorithms.

Patch set
=========

The patch #1 adds a document to Documentation/filesystems/.
The patch #2 adds a header file of on-disk layout to include/linux/.
The patches #3-#16 adds f2fs source files to fs/f2fs/.
The Last patch, patch #16, updates Makefile and Kconfig.

mkfs.f2fs
=========

The file system formatting tool, "mkfs.f2fs", is available from the following
download page: http://sourceforge.net/projects/f2fs-tools/

Usage
=====

If you'd like to experience f2fs, simply:
# mkfs.f2fs /dev/sdb1
# mount -t f2fs /dev/sdb1 /mnt/f2fs

Short log
=========

Jaegeuk Kim (17):
f2fs: add document
f2fs: add on-disk layout
f2fs: add superblock and major in-memory structure
f2fs: add super block operations
f2fs: add checkpoint operations
f2fs: add node operations
f2fs: add segment operations
f2fs: add file operations
f2fs: add address space operations for data
f2fs: add core inode operations
f2fs: add inode operations for special inodes
f2fs: add core directory operations
f2fs: add xattr and acl functionalities
f2fs: add garbage collection functions
f2fs: add recovery routines for roll-forward
f2fs: move proc files to debugfs
f2fs: update Kconfig and Makefile

Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/f2fs.txt | 417 +++++++++
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/f2fs/Kconfig | 52 ++
fs/f2fs/Makefile | 7 +
fs/f2fs/acl.c | 465 ++++++++++
fs/f2fs/acl.h | 57 ++
fs/f2fs/checkpoint.c | 792 ++++++++++++++++
fs/f2fs/data.c | 701 ++++++++++++++
fs/f2fs/debug.c | 361 ++++++++
fs/f2fs/dir.c | 674 ++++++++++++++
fs/f2fs/f2fs.h | 1062 +++++++++++++++++++++
fs/f2fs/file.c | 637 +++++++++++++
fs/f2fs/gc.c | 742 +++++++++++++++
fs/f2fs/gc.h | 117 +++
fs/f2fs/hash.c | 98 ++
fs/f2fs/inode.c | 266 ++++++
fs/f2fs/namei.c | 504 ++++++++++
fs/f2fs/node.c | 1763 +++++++++++++++++++++++++++++++++++
fs/f2fs/node.h | 353 +++++++
fs/f2fs/recovery.c | 375 ++++++++
fs/f2fs/segment.c | 1798 ++++++++++++++++++++++++++++++++++++
fs/f2fs/segment.h | 615 ++++++++++++
fs/f2fs/super.c | 656 +++++++++++++
fs/f2fs/xattr.c | 389 ++++++++
fs/f2fs/xattr.h | 145 +++
include/linux/f2fs_fs.h | 410 ++++++++
include/uapi/linux/magic.h | 1 +
29 files changed, 13461 insertions(+)
create mode 100644 Documentation/filesystems/f2fs.txt
create mode 100644 fs/f2fs/Kconfig
create mode 100644 fs/f2fs/Makefile
create mode 100644 fs/f2fs/acl.c
create mode 100644 fs/f2fs/acl.h
create mode 100644 fs/f2fs/checkpoint.c
create mode 100644 fs/f2fs/data.c
create mode 100644 fs/f2fs/debug.c
create mode 100644 fs/f2fs/dir.c
create mode 100644 fs/f2fs/f2fs.h
create mode 100644 fs/f2fs/file.c
create mode 100644 fs/f2fs/gc.c
create mode 100644 fs/f2fs/gc.h
create mode 100644 fs/f2fs/hash.c
create mode 100644 fs/f2fs/inode.c
create mode 100644 fs/f2fs/namei.c
create mode 100644 fs/f2fs/node.c
create mode 100644 fs/f2fs/node.h
create mode 100644 fs/f2fs/recovery.c
create mode 100644 fs/f2fs/segment.c
create mode 100644 fs/f2fs/segment.h
create mode 100644 fs/f2fs/super.c
create mode 100644 fs/f2fs/xattr.c
create mode 100644 fs/f2fs/xattr.h
create mode 100644 include/linux/f2fs_fs.h

--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:39:08

[permalink] [raw]

Subject: [PATCH 01/17] f2fs: add document

This adds a document describing the mount options, proc entries, usage, and
design of Flash-Friendly File System, namely F2FS.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/f2fs.txt | 417 ++++++++++++++++++++++++++++++++++++
2 files changed, 419 insertions(+)
create mode 100644 Documentation/filesystems/f2fs.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8c624a1..ce5fd46 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -48,6 +48,8 @@ ext4.txt
- info, mount options and specifications for the Ext4 filesystem.
files.txt
- info on file management in the Linux kernel.
+f2fs.txt
+ - info and mount options for the F2FS filesystem.
fuse.txt
- info on the Filesystem in User SpacE including mount options.
gfs2.txt
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
new file mode 100644
index 0000000..6ce5407
--- /dev/null
+++ b/Documentation/filesystems/f2fs.txt
@@ -0,0 +1,417 @@
+================================================================================
+WHAT IS Flash-Friendly File System (F2FS)?
+================================================================================
+
+NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
+been equipped on a variety systems ranging from mobile to server systems. Since
+they are known to have different characteristics from the conventional rotating
+disks, a file system, an upper layer to the storage device, should adapt to the
+changes from the sketch in the design level.
+
+F2FS is a file system exploiting NAND flash memory-based storage devices, which
+is based on Log-structured File System (LFS). The design has been focused on
+addressing the fundamental issues in LFS, which are snowball effect of wandering
+tree and high cleaning overhead.
+
+Since a NAND flash memory-based storage device shows different characteristic
+according to its internal geometry or flash memory management scheme, namely FTL,
+F2FS and its tools support various parameters not only for configuring on-disk
+layout, but also for selecting allocation and cleaning algorithms.
+
+The file system formatting tool, "mkfs.f2fs", is available from the following
+download page: http://sourceforge.net/projects/f2fs-tools/
+
+================================================================================
+BACKGROUND AND DESIGN ISSUES
+================================================================================
+
+Log-structured File System (LFS)
+--------------------------------
+"A log-structured file system writes all modifications to disk sequentially in
+a log-like structure, thereby speeding up both file writing and crash recovery.
+The log is the only structure on disk; it contains indexing information so that
+files can be read back from the log efficiently. In order to maintain large free
+areas on disk for fast writing, we divide the log into segments and use a
+segment cleaner to compress the live information from heavily fragmented
+segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
+implementation of a log-structured file system", ACM Trans. Computer Systems
+10, 1, 26–52.
+
+Wandering Tree Problem
+----------------------
+In LFS, when a file data is updated and written to the end of log, its direct
+pointer block is updated due to the changed location. Then the indirect pointer
+block is also updated due to the direct pointer block update. In this manner,
+the upper index structures such as inode, inode map, and checkpoint block are
+also updated recursively. This problem is called as wandering tree problem [1],
+and in order to enhance the performance, it should eliminate or relax the update
+propagation as much as possible.
+
+[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
+
+Cleaning Overhead
+-----------------
+Since LFS is based on out-of-place writes, it produces so many obsolete blocks
+scattered across the whole storage. In order to serve new empty log space, it
+needs to reclaim these obsolete blocks seamlessly to users. This job is called
+as a cleaning process.
+
+The process consists of three operations as follows.
+1. A victim segment is selected through referencing segment usage table.
+2. It loads parent index structures of all the data in the victim identified by
+ segment summary blocks.
+3. It checks the cross-reference between the data and its parent index structure.
+4. It moves valid data selectively.
+
+This cleaning job may cause unexpected long delays, so the most important goal
+is to hide the latencies to users. And also definitely, it should reduce the
+amount of valid data to be moved, and move them quickly as well.
+
+================================================================================
+KEY FEATURES
+================================================================================
+
+Flash Awareness
+---------------
+- Enlarge the random write area for better performance, but provide the high
+ spatial locality
+- Align FS data structures to the operational units in FTL as best efforts
+
+Wandering Tree Problem
+----------------------
+- Use a term, “node”, that represents inodes as well as various pointer blocks
+- Introduce Node Address Table (NAT) containing the locations of all the “node”
+ blocks; this will cut off the update propagation.
+
+Cleaning Overhead
+-----------------
+- Support a background cleaning process
+- Support greedy and cost-benefit algorithms for victim selection policies
+- Support multi-head logs for static/dynamic hot and cold data separation
+- Introduce adaptive logging for efficient block allocation
+
+================================================================================
+MOUNT OPTIONS
+================================================================================
+
+background_gc_off Turn off cleaning operations, namely garbage collection,
+ triggered in background when I/O subsystem is idle.
+disable_roll_forward Disable the roll-forward recovery routine
+discard Issue discard/TRIM commands when a segment is cleaned.
+no_heap Disable heap-style segment allocation which finds free
+ segments for data from the beginning of main area, while
+ for node from the end of main area.
+nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
+ by default if CONFIG_F2FS_FS_XATTR is selected.
+noacl Disable POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
+active_logs=%u Support configuring the number of active logs. In the
+ current design, f2fs supports only 2, 4, and 6 logs.
+ Default number is 6.
+disable_ext_identify Disable the extension list configured by mkfs, so f2fs
+ does not aware of cold files such as media files.
+
+================================================================================
+DEBUGFS ENTRIES
+================================================================================
+
+/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
+f2fs. Each file shows the whole f2fs information.
+
+/sys/kernel/debug/f2fs/status includes:
+ - major file system information managed by f2fs currently
+ - average SIT information about whole segments
+ - current memory footprint consumed by f2fs.
+
+================================================================================
+USAGE
+================================================================================
+
+1. Download userland tools and compile them.
+
+2. Skip, if f2fs was compiled statically inside kernel.
+ Otherwise, insert the f2fs.ko module.
+ # insmod f2fs.ko
+
+3. Create a directory trying to mount
+ # mkdir /mnt/f2fs
+
+4. Format the block device, and then mount as f2fs
+ # mkfs.f2fs -l label /dev/block_device
+ # mount -t f2fs /dev/block_device /mnt/f2fs
+
+Format options
+--------------
+-l [label] : Give a volume label, up to 256 unicode name.
+-a [0 or 1] : Split start location of each area for heap-based allocation.
+ 1 is set by default, which performs this.
+-o [int] : Set overprovision ratio in percent over volume size.
+ 5 is set by default.
+-s [int] : Set the number of segments per section.
+ 1 is set by default.
+-z [int] : Set the number of sections per zone.
+ 1 is set by default.
+-e [str] : Set basic extension list. e.g. "mp3,gif,mov"
+
+================================================================================
+DESIGN
+================================================================================
+
+On-disk Layout
+--------------
+
+F2FS divides the whole volume into a number of segments, each of which is fixed
+to 2MB in size. A section is composed of consecutive segments, and a zone
+consists of a set of sections. By default, section and zone sizes are set to one
+segment size identically, but users can easily modify the sizes by mkfs.
+
+F2FS splits the entire volume into six areas, and all the areas except superblock
+consists of multiple segments as described below.
+
+ align with the zone size <-|
+ |-> align with the segment size
+ _________________________________________________________________________
+ | | | Node | Segment | Segment | |
+ | Superblock | Checkpoint | Address | Info. | Summary | Main |
+ | (SB) | (CP) | Table (NAT) | Table (SIT) | Area (SSA) | |
+ |____________|_____2______|______N______|______N______|______N_____|__N___|
+ . .
+ . .
+ . .
+ ._________________________________________.
+ |_Segment_|_..._|_Segment_|_..._|_Segment_|
+ . .
+ ._________._________
+ |_section_|__...__|_
+ . .
+ .________.
+ |__zone__|
+
+- Superblock (SB)
+ : It is located at the beginning of the partition, and there exist two copies
+ to avoid file system crash. It contains basic partition information and some
+ default parameters of f2fs.
+
+- Checkpoint (CP)
+ : It contains file system information, bitmaps for valid NAT/SIT sets, orphan
+ inode lists, and summary entries of current active segments.
+
+- Node Address Table (NAT)
+ : It is composed of a block address table for all the node blocks stored in
+ Main area.
+
+- Segment Information Table (SIT)
+ : It contains segment information such as valid block count and bitmap for the
+ validity of all the blocks.
+
+- Segment Summary Area (SSA)
+ : It contains summary entries which contains the owner information of all the
+ data and node blocks stored in Main area.
+
+- Main Area
+ : It contains file and directory data including their indices.
+
+In order to avoid misalignment between file system and flash-based storage, F2FS
+aligns the start block address of CP with the segment size. Also, it aligns the
+start block address of Main area with the zone size by reserving some segments
+in SSA area.
+
+Reference the following survey for additional technical details.
+https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
+
+File System Metadata Structure
+------------------------------
+
+F2FS adopts the checkpointing scheme to maintain file system consistency. At
+mount time, F2FS first tries to find the last valid checkpoint data by scanning
+CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
+One of them always indicates the last valid data, which is called as shadow copy
+mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
+
+For file system consistency, each CP points to which NAT and SIT copies are
+valid, as shown as below.
+
+ +--------+----------+---------+
+ | CP | NAT | SIT |
+ +--------+----------+---------+
+ . . . .
+ . . . .
+ . . . .
+ +-------+-------+--------+--------+--------+--------+
+ | CP #0 | CP #1 | NAT #0 | NAT #1 | SIT #0 | SIT #1 |
+ +-------+-------+--------+--------+--------+--------+
+ | ^ ^
+ | | |
+ `----------------------------------------'
+
+Index Structure
+---------------
+
+The key data structure to manage the data locations is a "node". Similar to
+traditional file structures, F2FS has three types of node: inode, direct node,
+indirect node. F2FS assigns 4KB to an inode block which contains 929 data block
+indices, two direct node pointers, two indirect node pointers, and one double
+indirect node pointer as described below. One direct node block contains 1018
+data blocks, and one indirect node block contains also 1018 node blocks. Thus,
+one inode block (i.e., a file) covers:
+
+ 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
+
+ Inode block (4KB)
+ |- data (923)
+ |- direct node (2)
+ | `- data (1018)
+ |- indirect node (2)
+ | `- direct node (1018)
+ | `- data (1018)
+ `- double indirect node (1)
+ `- indirect node (1018)
+ `- direct node (1018)
+ `- data (1018)
+
+Note that, all the node blocks are mapped by NAT which means the location of
+each node is translated by the NAT table. In the consideration of the wandering
+tree problem, F2FS is able to cut off the propagation of node updates caused by
+leaf data writes.
+
+Directory Structure
+-------------------
+
+A directory entry occupies 11 bytes, which consists of the following attributes.
+
+- hash hash value of the file name
+- ino inode number
+- len the length of file name
+- type file type such as directory, symlink, etc
+
+A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
+used to represent whether each dentry is valid or not. A dentry block occupies
+4KB with the following composition.
+
+ Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
+ dentries(11 * 214 bytes) + file name (8 * 214 bytes)
+
+ [Bucket]
+ +--------------------------------+
+ |dentry block 1 | dentry block 2 |
+ +--------------------------------+
+ . .
+ . .
+ . [Dentry Block Structure: 4KB] .
+ +--------+----------+----------+------------+
+ | bitmap | reserved | dentries | file names |
+ +--------+----------+----------+------------+
+ [Dentry Block: 4KB] . .
+ . .
+ . .
+ +------+------+-----+------+
+ | hash | ino | len | type |
+ +------+------+-----+------+
+ [Dentry Structure: 11 bytes]
+
+F2FS implements multi-level hash tables for directory structure. Each level has
+a hash table with dedicated number of hash buckets as shown below. Note that
+"A(2B)" means a bucket includes 2 data blocks.
+
+----------------------
+A : bucket
+B : block
+N : MAX_DIR_HASH_DEPTH
+----------------------
+
+level #0 | A(2B)
+ |
+level #1 | A(2B) - A(2B)
+ |
+level #2 | A(2B) - A(2B) - A(2B) - A(2B)
+ . | . . . .
+level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
+ . | . . . .
+level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
+
+The number of blocks and buckets are determined by,
+
+ ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
+ # of blocks in level #n = |
+ `- 4, Otherwise
+
+ ,- 2^n, if n < MAX_DIR_HASH_DEPTH / 2,
+ # of buckets in level #n = |
+ `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), Otherwise
+
+When F2FS finds a file name in a directory, at first a hash value of the file
+name is calculated. Then, F2FS scans the hash table in level #0 to find the
+dentry consisting of the file name and its inode number. If not found, F2FS
+scans the next hash table in level #1. In this way, F2FS scans hash tables in
+each levels incrementally from 1 to N. In each levels F2FS needs to scan only
+one bucket determined by the following equation, which shows O(log(# of files))
+complexity.
+
+ bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
+
+In the case of file creation, F2FS finds empty consecutive slots that cover the
+file name. F2FS searches the empty slots in the hash tables of whole levels from
+1 to N in the same way as the lookup operation.
+
+The following figure shows an example of two cases holding children.
+ --------------> Dir <--------------
+ | |
+ child child
+
+ child - child [hole] - child
+
+ child - child - child [hole] - [hole] - child
+
+ Case 1: Case 2:
+ Number of children = 6, Number of children = 3,
+ File size = 7 File size = 7
+
+Default Block Allocation
+------------------------
+
+At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
+and Hot/Warm/Cold data.
+
+- Hot node contains direct node blocks of directories.
+- Warm node contains direct node blocks except hot node blocks.
+- Cold node contains indirect node blocks
+- Hot data contains dentry blocks
+- Warm data contains data blocks except hot and cold data blocks
+- Cold data contains multimedia data or migrated data blocks
+
+LFS has two schemes for free space management: threaded log and copy-and-compac-
+tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
+for devices showing very good sequential write performance, since free segments
+are served all the time for writing new data. However, it suffers from cleaning
+overhead under high utilization. Contrarily, the threaded log scheme suffers
+from random writes, but no cleaning process is needed. F2FS adopts a hybrid
+scheme where the copy-and-compaction scheme is adopted by default, but the
+policy is dynamically changed to the threaded log scheme according to the file
+system status.
+
+In order to align F2FS with underlying flash-based storage, F2FS allocates a
+segment in a unit of section. F2FS expects that the section size would be the
+same as the unit size of garbage collection in FTL. Furthermore, with respect
+to the mapping granularity in FTL, F2FS allocates each section of the active
+logs from different zones as much as possible, since FTL can write the data in
+the active logs into one allocation unit according to its mapping granularity.
+
+Cleaning process
+----------------
+
+F2FS does cleaning both on demand and in the background. On-demand cleaning is
+triggered when there are not enough free segments to serve VFS calls. Background
+cleaner is operated by a kernel thread, and triggers the cleaning job when the
+system is idle.
+
+F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
+In the greedy algorithm, F2FS selects a victim segment having the smallest number
+of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
+according to the segment age and the number of valid blocks in order to address
+log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
+algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
+algorithm.
+
+In order to identify whether the data in the victim segment are valid or not,
+F2FS manages a bitmap. Each bit represents the validity of a block, and the
+bitmap is composed of a bit stream covering whole blocks in main area.
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:40:12

[permalink] [raw]

Subject: [PATCH 02/17] f2fs: add on-disk layout

This adds a header file describing the on-disk layout of f2fs.

Signed-off-by: Changman Lee <[email protected]>
Signed-off-by: Chul Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
include/linux/f2fs_fs.h | 410 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 410 insertions(+)
create mode 100644 include/linux/f2fs_fs.h

diff --git a/include/linux/f2fs_fs.h b/include/linux/f2fs_fs.h
new file mode 100644
index 0000000..1429ece
--- /dev/null
+++ b/include/linux/f2fs_fs.h
@@ -0,0 +1,410 @@
+/**
+ * include/linux/f2fs_fs.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_F2FS_FS_H
+#define _LINUX_F2FS_FS_H
+
+#include <linux/pagemap.h>
+#include <linux/types.h>
+
+#define F2FS_SUPER_OFFSET 1024 /* byte-size offset */
+#define F2FS_LOG_SECTOR_SIZE 9 /* 9 bits for 512 byte */
+#define F2FS_LOG_SECTORS_PER_BLOCK 3 /* 4KB: F2FS_BLKSIZE */
+#define F2FS_BLKSIZE 4096 /* support only 4KB block */
+#define F2FS_MAX_EXTENSION 64 /* # of extension entries */
+
+#define NULL_ADDR 0x0U
+#define NEW_ADDR -1U
+
+#define F2FS_ROOT_INO(sbi) (sbi->root_ino_num)
+#define F2FS_NODE_INO(sbi) (sbi->node_ino_num)
+#define F2FS_META_INO(sbi) (sbi->meta_ino_num)
+
+/* This flag is used by node and meta inodes, and by recovery */
+#define GFP_F2FS_ZERO (GFP_NOFS | __GFP_ZERO)
+
+/*
+ * For further optimization on multi-head logs, on-disk layout supports maximum
+ * 16 logs by default. The number, 16, is expected to cover all the cases
+ * enoughly. The implementaion currently uses no more than 6 logs.
+ * Half the logs are used for nodes, and the other half are used for data.
+ */
+#define MAX_ACTIVE_LOGS 16
+#define MAX_ACTIVE_NODE_LOGS 8
+#define MAX_ACTIVE_DATA_LOGS 8
+
+/*
+ * For superblock
+ */
+struct f2fs_super_block {
+ __le32 magic; /* Magic Number */
+ __le16 major_ver; /* Major Version */
+ __le16 minor_ver; /* Minor Version */
+ __le32 log_sectorsize; /* log2 sector size in bytes */
+ __le32 log_sectors_per_block; /* log2 # of sectors per block */
+ __le32 log_blocksize; /* log2 block size in bytes */
+ __le32 log_blocks_per_seg; /* log2 # of blocks per segment */
+ __le32 segs_per_sec; /* # of segments per section */
+ __le32 secs_per_zone; /* # of sections per zone */
+ __le32 checksum_offset; /* checksum offset inside super block */
+ __le64 block_count; /* total # of user blocks */
+ __le32 section_count; /* total # of sections */
+ __le32 segment_count; /* total # of segments */
+ __le32 segment_count_ckpt; /* # of segments for checkpoint */
+ __le32 segment_count_sit; /* # of segments for SIT */
+ __le32 segment_count_nat; /* # of segments for NAT */
+ __le32 segment_count_ssa; /* # of segments for SSA */
+ __le32 segment_count_main; /* # of segments for main area */
+ __le32 segment0_blkaddr; /* start block address of segment 0 */
+ __le32 cp_blkaddr; /* start block address of checkpoint */
+ __le32 sit_blkaddr; /* start block address of SIT */
+ __le32 nat_blkaddr; /* start block address of NAT */
+ __le32 ssa_blkaddr; /* start block address of SSA */
+ __le32 main_blkaddr; /* start block address of main area */
+ __le32 root_ino; /* root inode number */
+ __le32 node_ino; /* node inode number */
+ __le32 meta_ino; /* meta inode number */
+ __u8 uuid[16]; /* 128-bit uuid for volume */
+ __le16 volume_name[512]; /* volume name */
+ __le32 extension_count; /* # of extensions below */
+ __u8 extension_list[F2FS_MAX_EXTENSION][8]; /* extension array */
+} __packed;
+
+/*
+ * For checkpoint
+ */
+#define CP_ERROR_FLAG 0x00000008
+#define CP_COMPACT_SUM_FLAG 0x00000004
+#define CP_ORPHAN_PRESENT_FLAG 0x00000002
+#define CP_UMOUNT_FLAG 0x00000001
+
+struct f2fs_checkpoint {
+ __le64 checkpoint_ver; /* checkpoint block version number */
+ __le64 user_block_count; /* # of user blocks */
+ __le64 valid_block_count; /* # of valid blocks in main area */
+ __le32 rsvd_segment_count; /* # of reserved segments for gc */
+ __le32 overprov_segment_count; /* # of overprovision segments */
+ __le32 free_segment_count; /* # of free segments in main area */
+
+ /* information of current node segments */
+ __le32 cur_node_segno[MAX_ACTIVE_NODE_LOGS];
+ __le16 cur_node_blkoff[MAX_ACTIVE_NODE_LOGS];
+ /* information of current data segments */
+ __le32 cur_data_segno[MAX_ACTIVE_DATA_LOGS];
+ __le16 cur_data_blkoff[MAX_ACTIVE_DATA_LOGS];
+ __le32 ckpt_flags; /* Flags : umount and journal_present */
+ __le32 cp_pack_total_block_count; /* total # of one cp pack */
+ __le32 cp_pack_start_sum; /* start block number of data summary */
+ __le32 valid_node_count; /* Total number of valid nodes */
+ __le32 valid_inode_count; /* Total number of valid inodes */
+ __le32 next_free_nid; /* Next free node number */
+ __le32 sit_ver_bitmap_bytesize; /* Default value 64 */
+ __le32 nat_ver_bitmap_bytesize; /* Default value 256 */
+ __le32 checksum_offset; /* checksum offset inside cp block */
+ __le64 elapsed_time; /* mounted time */
+ /* allocation type of current segment */
+ unsigned char alloc_type[MAX_ACTIVE_LOGS];
+
+ /* SIT and NAT version bitmap */
+ unsigned char sit_nat_version_bitmap[1];
+} __packed;
+
+/*
+ * For orphan inode management
+ */
+#define F2FS_ORPHANS_PER_BLOCK 1020
+
+struct f2fs_orphan_block {
+ __le32 ino[F2FS_ORPHANS_PER_BLOCK]; /* inode numbers */
+ __le32 reserved; /* reserved */
+ __le16 blk_addr; /* block index in current CP */
+ __le16 blk_count; /* Number of orphan inode blocks in CP */
+ __le32 entry_count; /* Total number of orphan nodes in current CP */
+ __le32 check_sum; /* CRC32 for orphan inode block */
+} __packed;
+
+/*
+ * For NODE structure
+ */
+struct f2fs_extent {
+ __le32 fofs; /* start file offset of the extent */
+ __le32 blk_addr; /* start block address of the extent */
+ __le32 len; /* lengh of the extent */
+} __packed;
+
+#define F2FS_MAX_NAME_LEN 256
+#define ADDRS_PER_INODE 923 /* Address Pointers in an Inode */
+#define ADDRS_PER_BLOCK 1018 /* Address Pointers in a Direct Block */
+#define NIDS_PER_BLOCK 1018 /* Node IDs in an Indirect Block */
+
+struct f2fs_inode {
+ __le16 i_mode; /* file mode */
+ __u8 i_advise; /* file hints */
+ __u8 i_reserved; /* reserved */
+ __le32 i_uid; /* user ID */
+ __le32 i_gid; /* group ID */
+ __le32 i_links; /* links count */
+ __le64 i_size; /* file size in bytes */
+ __le64 i_blocks; /* file size in blocks */
+ __le64 i_atime; /* access time */
+ __le64 i_ctime; /* change time */
+ __le64 i_mtime; /* modification time */
+ __le32 i_atime_nsec; /* access time in nano scale */
+ __le32 i_ctime_nsec; /* change time in nano scale */
+ __le32 i_mtime_nsec; /* modification time in nano scale */
+ __le32 i_generation; /* file version (for NFS) */
+ __le32 i_current_depth; /* only for directory depth */
+ __le32 i_xattr_nid; /* nid to save xattr */
+ __le32 i_flags; /* file attributes */
+ __le32 i_pino; /* parent inode number */
+ __le32 i_namelen; /* file name length */
+ __u8 i_name[F2FS_MAX_NAME_LEN]; /* file name for SPOR */
+
+ struct f2fs_extent i_ext; /* caching a largest extent */
+
+ __le32 i_addr[ADDRS_PER_INODE]; /* Pointers to data blocks */
+
+ __le32 i_nid[5]; /* direct(2), indirect(2),
+ double_indirect(1) node id */
+} __packed;
+
+struct direct_node {
+ __le32 addr[ADDRS_PER_BLOCK]; /* array of data block address */
+} __packed;
+
+struct indirect_node {
+ __le32 nid[NIDS_PER_BLOCK]; /* array of data block address */
+} __packed;
+
+enum {
+ COLD_BIT_SHIFT = 0,
+ FSYNC_BIT_SHIFT,
+ DENT_BIT_SHIFT,
+ OFFSET_BIT_SHIFT
+};
+
+struct node_footer {
+ __le32 nid; /* node id */
+ __le32 ino; /* inode nunmber */
+ __le32 flag; /* include cold/fsync/dentry marks and offset */
+ __le64 cp_ver; /* checkpoint version */
+ __le32 next_blkaddr; /* next node page block address */
+} __packed;
+
+struct f2fs_node {
+ /* can be one of three types: inode, direct, and indirect types */
+ union {
+ struct f2fs_inode i;
+ struct direct_node dn;
+ struct indirect_node in;
+ };
+ struct node_footer footer;
+} __packed;
+
+/*
+ * For NAT entries
+ */
+#define NAT_ENTRY_PER_BLOCK (PAGE_CACHE_SIZE / sizeof(struct f2fs_nat_entry))
+
+struct f2fs_nat_entry {
+ __u8 version; /* latest version of cached nat entry */
+ __le32 ino; /* inode number */
+ __le32 block_addr; /* block address */
+} __packed;
+
+struct f2fs_nat_block {
+ struct f2fs_nat_entry entries[NAT_ENTRY_PER_BLOCK];
+} __packed;
+
+/*
+ * For SIT entries
+ *
+ * Each segment is 2MB in size by default so that a bitmap for validity of
+ * there-in blocks should occupy 64 bytes, 512 bits.
+ * Not allow to change this.
+ */
+#define SIT_VBLOCK_MAP_SIZE 64
+#define SIT_ENTRY_PER_BLOCK (PAGE_CACHE_SIZE / sizeof(struct f2fs_sit_entry))
+
+/*
+ * Note that f2fs_sit_entry->vblocks has the following bit-field information.
+ * [15:10] : allocation type such as CURSEG_XXXX_TYPE
+ * [9:0] : valid block count
+ */
+#define SIT_VBLOCKS_SHIFT 10
+#define SIT_VBLOCKS_MASK ((1 << SIT_VBLOCKS_SHIFT) - 1)
+#define GET_SIT_VBLOCKS(raw_sit) \
+ (le16_to_cpu((raw_sit)->vblocks) & SIT_VBLOCKS_MASK)
+#define GET_SIT_TYPE(raw_sit) \
+ ((le16_to_cpu((raw_sit)->vblocks) & ~SIT_VBLOCKS_MASK) \
+ >> SIT_VBLOCKS_SHIFT)
+
+struct f2fs_sit_entry {
+ __le16 vblocks; /* reference above */
+ __u8 valid_map[SIT_VBLOCK_MAP_SIZE]; /* bitmap for valid blocks */
+ __le64 mtime; /* segment age for cleaning */
+} __packed;
+
+struct f2fs_sit_block {
+ struct f2fs_sit_entry entries[SIT_ENTRY_PER_BLOCK];
+} __packed;
+
+/*
+ * For segment summary
+ *
+ * One summary block contains exactly 512 summary entries, which represents
+ * exactly 2MB segment by default. Not allow to change the basic units.
+ *
+ * NOTE: For initializing fields, you must use set_summary
+ *
+ * - If data page, nid represents dnode's nid
+ * - If node page, nid represents the node page's nid.
+ *
+ * The ofs_in_node is used by only data page. It represents offset
+ * from node's page's beginning to get a data block address.
+ * ex) data_blkaddr = (block_t)(nodepage_start_address + ofs_in_node)
+ */
+#define ENTRIES_IN_SUM 512
+#define SUMMARY_SIZE (sizeof(struct f2fs_summary))
+#define SUM_FOOTER_SIZE (sizeof(struct summary_footer))
+#define SUM_ENTRY_SIZE (SUMMARY_SIZE * ENTRIES_IN_SUM)
+
+/* a summary entry for a 4KB-sized block in a segment */
+struct f2fs_summary {
+ __le32 nid; /* parent node id */
+ union {
+ __u8 reserved[3];
+ struct {
+ __u8 version; /* node version number */
+ __le16 ofs_in_node; /* block index in parent node */
+ } __packed;
+ };
+} __packed;
+
+/* summary block type, node or data, is stored to the summary_footer */
+#define SUM_TYPE_NODE (1)
+#define SUM_TYPE_DATA (0)
+
+struct summary_footer {
+ unsigned char entry_type; /* SUM_TYPE_XXX */
+ __u32 check_sum; /* summary checksum */
+} __packed;
+
+#define SUM_JOURNAL_SIZE (PAGE_CACHE_SIZE - SUM_FOOTER_SIZE -\
+ SUM_ENTRY_SIZE)
+#define NAT_JOURNAL_ENTRIES ((SUM_JOURNAL_SIZE - 2) /\
+ sizeof(struct nat_journal_entry))
+#define NAT_JOURNAL_RESERVED ((SUM_JOURNAL_SIZE - 2) %\
+ sizeof(struct nat_journal_entry))
+#define SIT_JOURNAL_ENTRIES ((SUM_JOURNAL_SIZE - 2) /\
+ sizeof(struct sit_journal_entry))
+#define SIT_JOURNAL_RESERVED ((SUM_JOURNAL_SIZE - 2) %\
+ sizeof(struct sit_journal_entry))
+/*
+ * frequently updated NAT/SIT entries can be stored in the spare area in
+ * summary blocks
+ */
+enum {
+ NAT_JOURNAL = 0,
+ SIT_JOURNAL
+};
+
+struct nat_journal_entry {
+ __le32 nid;
+ struct f2fs_nat_entry ne;
+} __packed;
+
+struct nat_journal {
+ struct nat_journal_entry entries[NAT_JOURNAL_ENTRIES];
+ __u8 reserved[NAT_JOURNAL_RESERVED];
+} __packed;
+
+struct sit_journal_entry {
+ __le32 segno;
+ struct f2fs_sit_entry se;
+} __packed;
+
+struct sit_journal {
+ struct sit_journal_entry entries[SIT_JOURNAL_ENTRIES];
+ __u8 reserved[SIT_JOURNAL_RESERVED];
+} __packed;
+
+/* 4KB-sized summary block structure */
+struct f2fs_summary_block {
+ struct f2fs_summary entries[ENTRIES_IN_SUM];
+ union {
+ __le16 n_nats;
+ __le16 n_sits;
+ };
+ /* spare area is used by NAT or SIT journals */
+ union {
+ struct nat_journal nat_j;
+ struct sit_journal sit_j;
+ };
+ struct summary_footer footer;
+} __packed;
+
+/*
+ * For directory operations
+ */
+#define F2FS_DOT_HASH 0
+#define F2FS_DDOT_HASH F2FS_DOT_HASH
+#define F2FS_MAX_HASH (~((0x3ULL) << 62))
+#define F2FS_HASH_COL_BIT ((0x1ULL) << 63)
+
+typedef __le32 f2fs_hash_t;
+
+/* One directory entry slot covers 8bytes-long file name */
+#define F2FS_NAME_LEN 8
+
+/* the number of dentry in a block */
+#define NR_DENTRY_IN_BLOCK 214
+
+/* MAX level for dir lookup */
+#define MAX_DIR_HASH_DEPTH 63
+
+#define SIZE_OF_DIR_ENTRY 11 /* by byte */
+#define SIZE_OF_DENTRY_BITMAP ((NR_DENTRY_IN_BLOCK + BITS_PER_BYTE - 1) / \
+ BITS_PER_BYTE)
+#define SIZE_OF_RESERVED (PAGE_SIZE - ((SIZE_OF_DIR_ENTRY + \
+ F2FS_NAME_LEN) * \
+ NR_DENTRY_IN_BLOCK + SIZE_OF_DENTRY_BITMAP))
+
+/* One directory entry slot representing F2FS_NAME_LEN-sized file name */
+struct f2fs_dir_entry {
+ __le32 hash_code; /* hash code of file name */
+ __le32 ino; /* inode number */
+ __le16 name_len; /* lengh of file name */
+ __u8 file_type; /* file type */
+} __packed;
+
+/* 4KB-sized directory entry block */
+struct f2fs_dentry_block {
+ /* validity bitmap for directory entries in each block */
+ __u8 dentry_bitmap[SIZE_OF_DENTRY_BITMAP];
+ __u8 reserved[SIZE_OF_RESERVED];
+ struct f2fs_dir_entry dentry[NR_DENTRY_IN_BLOCK];
+ __u8 filename[NR_DENTRY_IN_BLOCK][F2FS_NAME_LEN];
+} __packed;
+
+/* file types used in inode_info->flags */
+enum {
+ F2FS_FT_UNKNOWN,
+ F2FS_FT_REG_FILE,
+ F2FS_FT_DIR,
+ F2FS_FT_CHRDEV,
+ F2FS_FT_BLKDEV,
+ F2FS_FT_FIFO,
+ F2FS_FT_SOCK,
+ F2FS_FT_SYMLINK,
+ F2FS_FT_MAX
+};
+
+#endif /* _LINUX_F2FS_FS_H */
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:41:18

[permalink] [raw]

Subject: [PATCH 03/17] f2fs: add superblock and major in-memory structure

This adds the following major in-memory structures in f2fs.

- f2fs_sb_info:
contains f2fs-specific information, two special inode pointers for node and
meta address spaces, and orphan inode management.

- f2fs_inode_info:
contains vfs_inode and other fs-specific information.

- f2fs_nm_info:
contains node manager information such as NAT entry cache, free nid list,
and NAT page management.

- f2fs_node_info:
represents a node as node id, inode number, block address, and its version.

- f2fs_sm_info:
contains segment manager information such as SIT entry cache, free segment
map, current active logs, dirty segment management, and segment utilization.
The specific structures are sit_info, free_segmap_info, dirty_seglist_info,
curseg_info.

Signed-off-by: Chul Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/f2fs.h | 1062 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/node.h | 353 ++++++++++++++++++
fs/f2fs/segment.h | 615 +++++++++++++++++++++++++++++++
3 files changed, 2030 insertions(+)
create mode 100644 fs/f2fs/f2fs.h
create mode 100644 fs/f2fs/node.h
create mode 100644 fs/f2fs/segment.h

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
new file mode 100644
index 0000000..7aa70b5
--- /dev/null
+++ b/fs/f2fs/f2fs.h
@@ -0,0 +1,1062 @@
+/**
+ * fs/f2fs/f2fs.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_F2FS_H
+#define _LINUX_F2FS_H
+
+#include <linux/types.h>
+#include <linux/page-flags.h>
+#include <linux/buffer_head.h>
+#include <linux/version.h>
+#include <linux/slab.h>
+#include <linux/crc32.h>
+#include <linux/magic.h>
+
+/*
+ * For mount options
+ */
+#define F2FS_MOUNT_BG_GC 0x00000001
+#define F2FS_MOUNT_DISABLE_ROLL_FORWARD 0x00000002
+#define F2FS_MOUNT_DISCARD 0x00000004
+#define F2FS_MOUNT_NOHEAP 0x00000008
+#define F2FS_MOUNT_XATTR_USER 0x00000010
+#define F2FS_MOUNT_POSIX_ACL 0x00000020
+#define F2FS_MOUNT_DISABLE_EXT_IDENTIFY 0x00000040
+
+#define clear_opt(sbi, option) (sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
+#define set_opt(sbi, option) (sbi->mount_opt.opt |= F2FS_MOUNT_##option)
+#define test_opt(sbi, option) (sbi->mount_opt.opt & F2FS_MOUNT_##option)
+
+#define ver_after(a, b) (typecheck(unsigned long long, a) && \
+ typecheck(unsigned long long, b) && \
+ ((long long)((a) - (b)) > 0))
+
+typedef u64 block_t;
+typedef u32 nid_t;
+
+struct f2fs_mount_info {
+ unsigned int opt;
+};
+
+static inline __u32 f2fs_crc32(void *buff, size_t len)
+{
+ return crc32_le(F2FS_SUPER_MAGIC, buff, len);
+}
+
+static inline bool f2fs_crc_valid(__u32 blk_crc, void *buff, size_t buff_size)
+{
+ return f2fs_crc32(buff, buff_size) == blk_crc;
+}
+
+/*
+ * For checkpoint manager
+ */
+enum {
+ NAT_BITMAP,
+ SIT_BITMAP
+};
+
+/* for the list of orphan inodes */
+struct orphan_inode_entry {
+ struct list_head list; /* list head */
+ nid_t ino; /* inode number */
+};
+
+/* for the list of directory inodes */
+struct dir_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+};
+
+/* for the list of fsync inodes, used only during recovery */
+struct fsync_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+ block_t blkaddr; /* block address locating the last inode */
+};
+
+#define nats_in_cursum(sum) (le16_to_cpu(sum->n_nats))
+#define sits_in_cursum(sum) (le16_to_cpu(sum->n_sits))
+
+#define nat_in_journal(sum, i) (sum->nat_j.entries[i].ne)
+#define nid_in_journal(sum, i) (sum->nat_j.entries[i].nid)
+#define sit_in_journal(sum, i) (sum->sit_j.entries[i].se)
+#define segno_in_journal(sum, i) (sum->sit_j.entries[i].segno)
+
+static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = nats_in_cursum(rs);
+ rs->n_nats = cpu_to_le16(before + i);
+ return before;
+}
+
+static inline int update_sits_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = sits_in_cursum(rs);
+ rs->n_sits = cpu_to_le16(before + i);
+ return before;
+}
+
+/*
+ * For INODE and NODE manager
+ */
+#define XATTR_NODE_OFFSET (-1) /*
+ * store xattrs to one node block per
+ * file keeping -1 as its node offset to
+ * distinguish from index node blocks.
+ */
+#define RDONLY_NODE 1 /*
+ * specify a read-only mode when getting
+ * a node block. 0 is read-write mode.
+ * used by get_dnode_of_data().
+ */
+#define F2FS_LINK_MAX 32000 /* maximum link count per file */
+
+/* for in-memory extent cache entry */
+struct extent_info {
+ rwlock_t ext_lock; /* rwlock for consistency */
+ unsigned int fofs; /* start offset in a file */
+ u32 blk_addr; /* start block address of the extent */
+ unsigned int len; /* lenth of the extent */
+};
+
+/*
+ * i_advise uses FADVISE_XXX_BIT. We can add additional hints later.
+ */
+#define FADVISE_COLD_BIT 0x01
+
+struct f2fs_inode_info {
+ struct inode vfs_inode; /* serve a vfs inode */
+ unsigned long i_flags; /* keep an inode flags for ioctl */
+ unsigned char i_advise; /* use to give file attribute hints */
+ unsigned int i_current_depth; /* use only in directory structure */
+ umode_t i_acl_mode; /* keep file acl mode temporarily */
+
+ /* Use below internally in f2fs*/
+ unsigned long flags; /* use to pass per-file flags */
+ unsigned long long data_version;/* lastes version of data for fsync */
+ atomic_t dirty_dents; /* # of dirty dentry pages */
+ f2fs_hash_t chash; /* hash value of given file name */
+ unsigned int clevel; /* maximum level of given file name */
+ nid_t i_xattr_nid; /* node id that contains xattrs */
+ struct extent_info ext; /* in-memory extent cache entry */
+};
+
+static inline void get_extent_info(struct extent_info *ext,
+ struct f2fs_extent i_ext)
+{
+ write_lock(&ext->ext_lock);
+ ext->fofs = le32_to_cpu(i_ext.fofs);
+ ext->blk_addr = le32_to_cpu(i_ext.blk_addr);
+ ext->len = le32_to_cpu(i_ext.len);
+ write_unlock(&ext->ext_lock);
+}
+
+static inline void set_raw_extent(struct extent_info *ext,
+ struct f2fs_extent *i_ext)
+{
+ read_lock(&ext->ext_lock);
+ i_ext->fofs = cpu_to_le32(ext->fofs);
+ i_ext->blk_addr = cpu_to_le32(ext->blk_addr);
+ i_ext->len = cpu_to_le32(ext->len);
+ read_unlock(&ext->ext_lock);
+}
+
+struct f2fs_nm_info {
+ block_t nat_blkaddr; /* base disk address of NAT */
+ nid_t max_nid; /* maximum possible node ids */
+ nid_t init_scan_nid; /* the first nid to be scanned */
+ nid_t next_scan_nid; /* the next nid to be scanned */
+
+ /* NAT cache management */
+ struct radix_tree_root nat_root;/* root of the nat entry cache */
+ rwlock_t nat_tree_lock; /* protect nat_tree_lock */
+ unsigned int nat_cnt; /* the # of cached nat entries */
+ struct list_head nat_entries; /* cached nat entry list (clean) */
+ struct list_head dirty_nat_entries; /* cached nat entry list (dirty) */
+
+ /* free node ids management */
+ struct list_head free_nid_list; /* a list for free nids */
+ spinlock_t free_nid_list_lock; /* protect free nid list */
+ unsigned int fcnt; /* the number of free node id */
+ struct mutex build_lock; /* lock for build free nids */
+
+ /* for checkpoint */
+ char *nat_bitmap; /* NAT bitmap pointer */
+ int bitmap_size; /* bitmap size */
+};
+
+/*
+ * this structure is used as one of function parameters.
+ * all the information are dedicated to a given direct node block determined
+ * by the data offset in a file.
+ */
+struct dnode_of_data {
+ struct inode *inode; /* vfs inode pointer */
+ struct page *inode_page; /* its inode page, NULL is possible */
+ struct page *node_page; /* cached direct node page */
+ nid_t nid; /* node id of the direct node block */
+ unsigned int ofs_in_node; /* data offset in the node page */
+ bool inode_page_locked; /* inode page is locked or not */
+ block_t data_blkaddr; /* block address of the node block */
+};
+
+static inline void set_new_dnode(struct dnode_of_data *dn, struct inode *inode,
+ struct page *ipage, struct page *npage, nid_t nid)
+{
+ dn->inode = inode;
+ dn->inode_page = ipage;
+ dn->node_page = npage;
+ dn->nid = nid;
+ dn->inode_page_locked = 0;
+}
+
+/*
+ * For SIT manager
+ *
+ * By default, there are 6 active log areas across the whole main area.
+ * When considering hot and cold data separation to reduce cleaning overhead,
+ * we split 3 for data logs and 3 for node logs as hot, warm, and cold types,
+ * respectively.
+ * In the current design, you should not change the numbers intentionally.
+ * Instead, as a mount option such as active_logs=x, you can use 2, 4, and 6
+ * logs individually according to the underlying devices. (default: 6)
+ * Just in case, on-disk layout covers maximum 16 logs that consist of 8 for
+ * data and 8 for node logs.
+ */
+#define NR_CURSEG_DATA_TYPE (3)
+#define NR_CURSEG_NODE_TYPE (3)
+#define NR_CURSEG_TYPE (NR_CURSEG_DATA_TYPE + NR_CURSEG_NODE_TYPE)
+
+enum {
+ CURSEG_HOT_DATA = 0, /* directory entry blocks */
+ CURSEG_WARM_DATA, /* data blocks */
+ CURSEG_COLD_DATA, /* multimedia or GCed data blocks */
+ CURSEG_HOT_NODE, /* direct node blocks of directory files */
+ CURSEG_WARM_NODE, /* direct node blocks of normal files */
+ CURSEG_COLD_NODE, /* indirect node blocks */
+ NO_CHECK_TYPE
+};
+
+struct f2fs_sm_info {
+ struct sit_info *sit_info; /* whole segment information */
+ struct free_segmap_info *free_info; /* free segment information */
+ struct dirty_seglist_info *dirty_info; /* dirty segment information */
+ struct curseg_info *curseg_array; /* active segment information */
+
+ struct list_head wblist_head; /* list of under-writeback pages */
+ spinlock_t wblist_lock; /* lock for checkpoint */
+
+ block_t seg0_blkaddr; /* block address of 0'th segment */
+ block_t main_blkaddr; /* start block address of main area */
+ block_t ssa_blkaddr; /* start block address of SSA area */
+
+ unsigned int segment_count; /* total # of segments */
+ unsigned int main_segments; /* # of segments in main area */
+ unsigned int reserved_segments; /* # of reserved segments */
+ unsigned int ovp_segments; /* # of overprovision segments */
+};
+
+/*
+ * For directory operation
+ */
+#define NODE_DIR1_BLOCK (ADDRS_PER_INODE + 1)
+#define NODE_DIR2_BLOCK (ADDRS_PER_INODE + 2)
+#define NODE_IND1_BLOCK (ADDRS_PER_INODE + 3)
+#define NODE_IND2_BLOCK (ADDRS_PER_INODE + 4)
+#define NODE_DIND_BLOCK (ADDRS_PER_INODE + 5)
+
+/*
+ * For superblock
+ */
+/*
+ * COUNT_TYPE for monitoring
+ *
+ * f2fs monitors the number of several block types such as on-writeback,
+ * dirty dentry blocks, dirty node blocks, and dirty meta blocks.
+ */
+enum count_type {
+ F2FS_WRITEBACK,
+ F2FS_DIRTY_DENTS,
+ F2FS_DIRTY_NODES,
+ F2FS_DIRTY_META,
+ NR_COUNT_TYPE,
+};
+
+/*
+ * FS_LOCK nesting subclasses for the lock validator:
+ *
+ * The locking order between these classes is
+ * RENAME -> DENTRY_OPS -> DATA_WRITE -> DATA_NEW
+ * -> DATA_TRUNC -> NODE_WRITE -> NODE_NEW -> NODE_TRUNC
+ */
+enum lock_type {
+ RENAME, /* for renaming operations */
+ DENTRY_OPS, /* for directory operations */
+ DATA_WRITE, /* for data write */
+ DATA_NEW, /* for data allocation */
+ DATA_TRUNC, /* for data truncate */
+ NODE_NEW, /* for node allocation */
+ NODE_TRUNC, /* for node truncate */
+ NODE_WRITE, /* for node write */
+ NR_LOCK_TYPE,
+};
+
+/*
+ * The below are the page types of bios used in submti_bio().
+ * The available types are:
+ * DATA User data pages. It operates as async mode.
+ * NODE Node pages. It operates as async mode.
+ * META FS metadata pages such as SIT, NAT, CP.
+ * NR_PAGE_TYPE The number of page types.
+ * META_FLUSH Make sure the previous pages are written
+ * with waiting the bio's completion
+ * ... Only can be used with META.
+ */
+enum page_type {
+ DATA,
+ NODE,
+ META,
+ NR_PAGE_TYPE,
+ META_FLUSH,
+};
+
+struct f2fs_sb_info {
+ struct super_block *sb; /* pointer to VFS super block */
+ struct buffer_head *raw_super_buf; /* buffer head of raw sb */
+ struct f2fs_super_block *raw_super; /* raw super block pointer */
+ int s_dirty; /* dirty flag for checkpoint */
+
+ /* for node-related operations */
+ struct f2fs_nm_info *nm_info; /* node manager */
+ struct inode *node_inode; /* cache node blocks */
+
+ /* for segment-related operations */
+ struct f2fs_sm_info *sm_info; /* segment manager */
+ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */
+ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */
+ struct rw_semaphore bio_sem; /* IO semaphore */
+
+ /* for checkpoint */
+ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */
+ struct inode *meta_inode; /* cache meta blocks */
+ struct mutex cp_mutex; /* for checkpoint procedure */
+ struct mutex fs_lock[NR_LOCK_TYPE]; /* for blocking FS operations */
+ struct mutex write_inode; /* mutex for write inode */
+ struct mutex writepages; /* mutex for writepages() */
+ int por_doing; /* recovery is doing or not */
+
+ /* for orphan inode management */
+ struct list_head orphan_inode_list; /* orphan inode list */
+ struct mutex orphan_inode_mutex; /* for orphan inode list */
+ unsigned int n_orphans; /* # of orphan inodes */
+
+ /* for directory inode management */
+ struct list_head dir_inode_list; /* dir inode list */
+ spinlock_t dir_inode_lock; /* for dir inode list lock */
+ unsigned int n_dirty_dirs; /* # of dir inodes */
+
+ /* basic file system units */
+ unsigned int log_sectors_per_block; /* log2 sectors per block */
+ unsigned int log_blocksize; /* log2 block size */
+ unsigned int blocksize; /* block size */
+ unsigned int root_ino_num; /* root inode number*/
+ unsigned int node_ino_num; /* node inode number*/
+ unsigned int meta_ino_num; /* meta inode number*/
+ unsigned int log_blocks_per_seg; /* log2 blocks per segment */
+ unsigned int blocks_per_seg; /* blocks per segment */
+ unsigned int segs_per_sec; /* segments per section */
+ unsigned int secs_per_zone; /* sections per zone */
+ unsigned int total_sections; /* total section count */
+ unsigned int total_node_count; /* total node block count */
+ unsigned int total_valid_node_count; /* valid node block count */
+ unsigned int total_valid_inode_count; /* valid inode count */
+ int active_logs; /* # of active logs */
+
+ block_t user_block_count; /* # of user blocks */
+ block_t total_valid_block_count; /* # of valid blocks */
+ block_t alloc_valid_block_count; /* # of allocated blocks */
+ block_t last_valid_block_count; /* for recovery */
+ u32 s_next_generation; /* for NFS support */
+ atomic_t nr_pages[NR_COUNT_TYPE]; /* # of pages, see count_type */
+
+ struct f2fs_mount_info mount_opt; /* mount options */
+
+ /* for cleaning operations */
+ struct mutex gc_mutex; /* mutex for GC */
+ struct f2fs_gc_kthread *gc_thread; /* GC thread */
+
+ /*
+ * for stat information.
+ * one is for the LFS mode, and the other is for the SSR mode.
+ */
+ struct f2fs_stat_info *stat_info; /* FS status information */
+ unsigned int segment_count[2]; /* # of allocated segments */
+ unsigned int block_count[2]; /* # of allocated blocks */
+ unsigned int last_victim[2]; /* last victim segment # */
+ int total_hit_ext, read_hit_ext; /* extent cache hit ratio */
+ int bg_gc; /* background gc calls */
+ spinlock_t stat_lock; /* lock for stat operations */
+};
+
+/*
+ * Inline functions
+ */
+static inline struct f2fs_inode_info *F2FS_I(struct inode *inode)
+{
+ return container_of(inode, struct f2fs_inode_info, vfs_inode);
+}
+
+static inline struct f2fs_sb_info *F2FS_SB(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
+static inline struct f2fs_super_block *F2FS_RAW_SUPER(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_super_block *)(sbi->raw_super);
+}
+
+static inline struct f2fs_checkpoint *F2FS_CKPT(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_checkpoint *)(sbi->ckpt);
+}
+
+static inline struct f2fs_nm_info *NM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_nm_info *)(sbi->nm_info);
+}
+
+static inline struct f2fs_sm_info *SM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_sm_info *)(sbi->sm_info);
+}
+
+static inline struct sit_info *SIT_I(struct f2fs_sb_info *sbi)
+{
+ return (struct sit_info *)(SM_I(sbi)->sit_info);
+}
+
+static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
+{
+ return (struct free_segmap_info *)(SM_I(sbi)->free_info);
+}
+
+static inline struct dirty_seglist_info *DIRTY_I(struct f2fs_sb_info *sbi)
+{
+ return (struct dirty_seglist_info *)(SM_I(sbi)->dirty_info);
+}
+
+static inline void F2FS_SET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 1;
+}
+
+static inline void F2FS_RESET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 0;
+}
+
+static inline void mutex_lock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_lock_nested(&sbi->fs_lock[t], t);
+}
+
+static inline void mutex_unlock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_unlock(&sbi->fs_lock[t]);
+}
+
+/*
+ * Check whether the given nid is within node id range.
+ */
+static inline void check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ BUG_ON((nid >= NM_I(sbi)->max_nid));
+}
+
+#define F2FS_DEFAULT_ALLOCATED_BLOCKS 1
+
+/*
+ * Check whether the inode has blocks or not
+ */
+static inline int F2FS_HAS_BLOCKS(struct inode *inode)
+{
+ if (F2FS_I(inode)->i_xattr_nid)
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS + 1);
+ else
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS);
+}
+
+static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode, blkcnt_t count)
+{
+ block_t valid_block_count;
+
+ spin_lock(&sbi->stat_lock);
+ valid_block_count =
+ sbi->total_valid_block_count + (block_t)count;
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+ inode->i_blocks += count;
+ sbi->total_valid_block_count = valid_block_count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return true;
+}
+
+static inline int dec_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ blkcnt_t count)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_block_count < (block_t) count);
+ BUG_ON(inode->i_blocks < count);
+ inode->i_blocks -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline void inc_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_inc(&sbi->nr_pages[count_type]);
+ F2FS_SET_SB_DIRT(sbi);
+}
+
+static inline void inode_inc_dirty_dents(struct inode *inode)
+{
+ atomic_inc(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline void dec_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_dec(&sbi->nr_pages[count_type]);
+}
+
+static inline void inode_dec_dirty_dents(struct inode *inode)
+{
+ atomic_dec(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline int get_pages(struct f2fs_sb_info *sbi, int count_type)
+{
+ return atomic_read(&sbi->nr_pages[count_type]);
+}
+
+static inline block_t valid_user_blocks(struct f2fs_sb_info *sbi)
+{
+ block_t ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline unsigned long __bitmap_size(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+
+ /* return NAT or SIT bitmap */
+ if (flag == NAT_BITMAP)
+ return le32_to_cpu(ckpt->nat_ver_bitmap_bytesize);
+ else if (flag == SIT_BITMAP)
+ return le32_to_cpu(ckpt->sit_ver_bitmap_bytesize);
+
+ return 0;
+}
+
+static inline void *__bitmap_ptr(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ int offset = (flag == NAT_BITMAP) ? ckpt->sit_ver_bitmap_bytesize : 0;
+ return &ckpt->sit_nat_version_bitmap + offset;
+}
+
+static inline block_t __start_cp_addr(struct f2fs_sb_info *sbi)
+{
+ block_t start_addr;
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ unsigned long long ckpt_version = le64_to_cpu(ckpt->checkpoint_ver);
+
+ start_addr = le64_to_cpu(F2FS_RAW_SUPER(sbi)->cp_blkaddr);
+
+ /*
+ * odd numbered checkpoint should at cp segment 0
+ * and even segent must be at cp segment 1
+ */
+ if (!(ckpt_version & 1))
+ start_addr += sbi->blocks_per_seg;
+
+ return start_addr;
+}
+
+static inline block_t __start_sum_addr(struct f2fs_sb_info *sbi)
+{
+ return le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline bool inc_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ block_t valid_block_count;
+ unsigned int valid_node_count;
+
+ spin_lock(&sbi->stat_lock);
+
+ valid_block_count = sbi->total_valid_block_count + (block_t)count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ valid_node_count = sbi->total_valid_node_count + count;
+
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (valid_node_count > sbi->total_node_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (inode)
+ inode->i_blocks += count;
+ sbi->total_valid_node_count = valid_node_count;
+ sbi->total_valid_block_count = valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+
+ return true;
+}
+
+static inline void dec_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ spin_lock(&sbi->stat_lock);
+
+ BUG_ON(sbi->total_valid_block_count < count);
+ BUG_ON(sbi->total_valid_node_count < count);
+ BUG_ON(inode->i_blocks < count);
+
+ inode->i_blocks -= count;
+ sbi->total_valid_node_count -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline unsigned int valid_node_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_node_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void inc_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_inode_count == sbi->total_node_count);
+ sbi->total_valid_inode_count++;
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline int dec_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(!sbi->total_valid_inode_count);
+ sbi->total_valid_inode_count--;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline unsigned int valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_inode_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void f2fs_put_page(struct page *page, int unlock)
+{
+ if (!page || IS_ERR(page))
+ return;
+
+ if (unlock) {
+ BUG_ON(!PageLocked(page));
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+static inline void f2fs_put_dnode(struct dnode_of_data *dn)
+{
+ if (dn->node_page)
+ f2fs_put_page(dn->node_page, 1);
+ if (dn->inode_page && dn->node_page != dn->inode_page)
+ f2fs_put_page(dn->inode_page, 0);
+ dn->node_page = NULL;
+ dn->inode_page = NULL;
+}
+
+static inline struct kmem_cache *f2fs_kmem_cache_create(const char *name,
+ size_t size, void (*ctor)(void *))
+{
+ return kmem_cache_create(name, size, 0, SLAB_RECLAIM_ACCOUNT, ctor);
+}
+
+#define RAW_IS_INODE(p) ((p)->footer.nid == (p)->footer.ino)
+
+static inline bool IS_INODE(struct page *page)
+{
+ struct f2fs_node *p = (struct f2fs_node *)page_address(page);
+ return RAW_IS_INODE(p);
+}
+
+static inline __le32 *blkaddr_in_node(struct f2fs_node *node)
+{
+ return RAW_IS_INODE(node) ? node->i.i_addr : node->dn.addr;
+}
+
+static inline block_t datablock_addr(struct page *node_page,
+ unsigned int offset)
+{
+ struct f2fs_node *raw_node;
+ __le32 *addr_array;
+ raw_node = (struct f2fs_node *)page_address(node_page);
+ addr_array = blkaddr_in_node(raw_node);
+ return le32_to_cpu(addr_array[offset]);
+}
+
+static inline int f2fs_test_bit(unsigned int nr, char *addr)
+{
+ int mask;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ return mask & *addr;
+}
+
+static inline int f2fs_set_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr |= mask;
+ return ret;
+}
+
+static inline int f2fs_clear_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr &= ~mask;
+ return ret;
+}
+
+/* used for f2fs_inode_info->flags */
+enum {
+ FI_NEW_INODE, /* indicate newly allocated inode */
+ FI_NEED_CP, /* need to do checkpoint during fsync */
+ FI_INC_LINK, /* need to increment i_nlink */
+ FI_ACL_MODE, /* indicate acl mode */
+ FI_NO_ALLOC, /* should not allocate any blocks */
+};
+
+static inline void set_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ set_bit(flag, &fi->flags);
+}
+
+static inline int is_inode_flag_set(struct f2fs_inode_info *fi, int flag)
+{
+ return test_bit(flag, &fi->flags);
+}
+
+static inline void clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ clear_bit(flag, &fi->flags);
+}
+
+static inline void set_acl_inode(struct f2fs_inode_info *fi, umode_t mode)
+{
+ fi->i_acl_mode = mode;
+ set_inode_flag(fi, FI_ACL_MODE);
+}
+
+static inline int cond_clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ if (is_inode_flag_set(fi, FI_ACL_MODE)) {
+ clear_inode_flag(fi, FI_ACL_MODE);
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * file.c
+ */
+int f2fs_sync_file(struct file *, loff_t, loff_t, int);
+void truncate_data_blocks(struct dnode_of_data *);
+void f2fs_truncate(struct inode *);
+int f2fs_setattr(struct dentry *, struct iattr *);
+int truncate_hole(struct inode *, pgoff_t, pgoff_t);
+long f2fs_ioctl(struct file *, unsigned int, unsigned long);
+
+/*
+ * inode.c
+ */
+void f2fs_set_inode_flags(struct inode *);
+struct inode *f2fs_iget_nowait(struct super_block *, unsigned long);
+struct inode *f2fs_iget(struct super_block *, unsigned long);
+void update_inode(struct inode *, struct page *);
+int f2fs_write_inode(struct inode *, struct writeback_control *);
+void f2fs_evict_inode(struct inode *);
+
+/*
+ * namei.c
+ */
+struct dentry *f2fs_get_parent(struct dentry *child);
+
+/*
+ * dir.c
+ */
+struct f2fs_dir_entry *f2fs_find_entry(struct inode *, struct qstr *,
+ struct page **);
+struct f2fs_dir_entry *f2fs_parent_dir(struct inode *, struct page **);
+ino_t f2fs_inode_by_name(struct inode *, struct qstr *);
+void f2fs_set_link(struct inode *, struct f2fs_dir_entry *,
+ struct page *, struct inode *);
+void init_dent_inode(struct dentry *, struct page *);
+int f2fs_add_link(struct dentry *, struct inode *);
+void f2fs_delete_entry(struct f2fs_dir_entry *, struct page *, struct inode *);
+int f2fs_make_empty(struct inode *, struct inode *);
+bool f2fs_empty_dir(struct inode *);
+
+/*
+ * super.c
+ */
+int f2fs_sync_fs(struct super_block *, int);
+
+/*
+ * hash.c
+ */
+f2fs_hash_t f2fs_dentry_hash(const char *, int);
+
+/*
+ * node.c
+ */
+struct dnode_of_data;
+struct node_info;
+
+int is_checkpointed_node(struct f2fs_sb_info *, nid_t);
+void get_node_info(struct f2fs_sb_info *, nid_t, struct node_info *);
+int get_dnode_of_data(struct dnode_of_data *, pgoff_t, int);
+int truncate_inode_blocks(struct inode *, pgoff_t);
+int remove_inode_page(struct inode *);
+int new_inode_page(struct inode *, struct dentry *);
+struct page *new_node_page(struct dnode_of_data *, unsigned int);
+void ra_node_page(struct f2fs_sb_info *, nid_t);
+struct page *get_node_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_node_page_ra(struct page *, int);
+void sync_inode_page(struct dnode_of_data *);
+int sync_node_pages(struct f2fs_sb_info *, nid_t, struct writeback_control *);
+bool alloc_nid(struct f2fs_sb_info *, nid_t *);
+void alloc_nid_done(struct f2fs_sb_info *, nid_t);
+void alloc_nid_failed(struct f2fs_sb_info *, nid_t);
+void recover_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, struct node_info *, block_t);
+int recover_inode_page(struct f2fs_sb_info *, struct page *);
+int restore_node_summary(struct f2fs_sb_info *, unsigned int,
+ struct f2fs_summary_block *);
+void flush_nat_entries(struct f2fs_sb_info *);
+int build_node_manager(struct f2fs_sb_info *);
+void destroy_node_manager(struct f2fs_sb_info *);
+int create_node_manager_caches(void);
+void destroy_node_manager_caches(void);
+
+/*
+ * segment.c
+ */
+void f2fs_balance_fs(struct f2fs_sb_info *);
+void invalidate_blocks(struct f2fs_sb_info *, block_t);
+void locate_dirty_segment(struct f2fs_sb_info *, unsigned int);
+void clear_prefree_segments(struct f2fs_sb_info *);
+int npages_for_summary_flush(struct f2fs_sb_info *);
+void allocate_new_segments(struct f2fs_sb_info *);
+struct page *get_sum_page(struct f2fs_sb_info *, unsigned int);
+struct bio *f2fs_bio_alloc(struct block_device *, sector_t, int, gfp_t);
+void f2fs_submit_bio(struct f2fs_sb_info *, enum page_type, bool sync);
+int write_meta_page(struct f2fs_sb_info *, struct page *,
+ struct writeback_control *);
+void write_node_page(struct f2fs_sb_info *, struct page *, unsigned int,
+ block_t, block_t *);
+void write_data_page(struct inode *, struct page *, struct dnode_of_data*,
+ block_t, block_t *);
+void rewrite_data_page(struct f2fs_sb_info *, struct page *, block_t);
+void recover_data_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void rewrite_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void write_data_summaries(struct f2fs_sb_info *, block_t);
+void write_node_summaries(struct f2fs_sb_info *, block_t);
+int lookup_journal_in_cursum(struct f2fs_summary_block *,
+ int, unsigned int, int);
+void flush_sit_entries(struct f2fs_sb_info *);
+int build_segment_manager(struct f2fs_sb_info *);
+void reset_victim_segmap(struct f2fs_sb_info *);
+void destroy_segment_manager(struct f2fs_sb_info *);
+
+/*
+ * checkpoint.c
+ */
+struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t);
+long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long);
+int check_orphan_space(struct f2fs_sb_info *);
+void add_orphan_inode(struct f2fs_sb_info *, nid_t);
+void remove_orphan_inode(struct f2fs_sb_info *, nid_t);
+int recover_orphan_inodes(struct f2fs_sb_info *);
+int get_valid_checkpoint(struct f2fs_sb_info *);
+void set_dirty_dir_page(struct inode *, struct page *);
+void remove_dirty_dir_inode(struct inode *);
+void sync_dirty_dir_inodes(struct f2fs_sb_info *);
+void block_operations(struct f2fs_sb_info *);
+void write_checkpoint(struct f2fs_sb_info *, bool, bool);
+void init_orphan_info(struct f2fs_sb_info *);
+int create_checkpoint_caches(void);
+void destroy_checkpoint_caches(void);
+
+/*
+ * data.c
+ */
+int reserve_new_block(struct dnode_of_data *);
+void update_extent_cache(block_t, struct dnode_of_data *);
+struct page *find_data_page(struct inode *, pgoff_t);
+struct page *get_lock_data_page(struct inode *, pgoff_t);
+struct page *get_new_data_page(struct inode *, pgoff_t, bool);
+int f2fs_readpage(struct f2fs_sb_info *, struct page *, block_t, int);
+int do_write_data_page(struct page *);
+
+/*
+ * gc.c
+ */
+int start_gc_thread(struct f2fs_sb_info *);
+void stop_gc_thread(struct f2fs_sb_info *);
+block_t start_bidx_of_node(unsigned int);
+int f2fs_gc(struct f2fs_sb_info *, int);
+void build_gc_manager(struct f2fs_sb_info *);
+int create_gc_caches(void);
+void destroy_gc_caches(void);
+
+/*
+ * recovery.c
+ */
+void recover_fsync_data(struct f2fs_sb_info *);
+bool space_for_roll_forward(struct f2fs_sb_info *);
+
+/*
+ * debug.c
+ */
+#ifdef CONFIG_F2FS_STAT_FS
+struct f2fs_stat_info {
+ struct list_head stat_list;
+ struct f2fs_sb_info *sbi;
+ struct mutex stat_lock;
+ int all_area_segs, sit_area_segs, nat_area_segs, ssa_area_segs;
+ int main_area_segs, main_area_sections, main_area_zones;
+ int hit_ext, total_ext;
+ int ndirty_node, ndirty_dent, ndirty_dirs, ndirty_meta;
+ int nats, sits, fnids;
+ int total_count, utilization;
+ int bg_gc;
+ unsigned int valid_count, valid_node_count, valid_inode_count;
+ unsigned int bimodal, avg_vblocks;
+ int util_free, util_valid, util_invalid;
+ int rsvd_segs, overp_segs;
+ int dirty_count, node_pages, meta_pages;
+ int prefree_count, call_count;
+ int tot_segs, node_segs, data_segs, free_segs, free_secs;
+ int tot_blks, data_blks, node_blks;
+ int curseg[NR_CURSEG_TYPE];
+ int cursec[NR_CURSEG_TYPE];
+ int curzone[NR_CURSEG_TYPE];
+
+ unsigned int segment_count[2];
+ unsigned int block_count[2];
+ unsigned base_mem, cache_mem;
+};
+
+#define stat_inc_call_count(si) ((si)->call_count++)
+
+#define stat_inc_seg_count(sbi, type) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ (si)->tot_segs++; \
+ if (type == SUM_TYPE_DATA) \
+ si->data_segs++; \
+ else \
+ si->node_segs++; \
+ } while (0)
+
+#define stat_inc_tot_blk_count(si, blks) \
+ (si->tot_blks += (blks))
+
+#define stat_inc_data_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->data_blks += (blks); \
+ } while (0)
+
+#define stat_inc_node_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->node_blks += (blks); \
+ } while (0)
+
+int f2fs_build_stats(struct f2fs_sb_info *);
+void f2fs_destroy_stats(struct f2fs_sb_info *);
+void destroy_root_stats(void);
+#else
+#define stat_inc_call_count(si)
+#define stat_inc_seg_count(si, type)
+#define stat_inc_tot_blk_count(si, blks)
+#define stat_inc_data_blk_count(si, blks)
+#define stat_inc_node_blk_count(sbi, blks)
+
+static inline int f2fs_build_stats(struct f2fs_sb_info *sbi) { return 0; }
+static inline void f2fs_destroy_stats(struct f2fs_sb_info *sbi) { }
+static inline void destroy_root_stats(void) { }
+#endif
+
+extern const struct file_operations f2fs_dir_operations;
+extern const struct file_operations f2fs_file_operations;
+extern const struct inode_operations f2fs_file_inode_operations;
+extern const struct address_space_operations f2fs_dblock_aops;
+extern const struct address_space_operations f2fs_node_aops;
+extern const struct address_space_operations f2fs_meta_aops;
+extern const struct inode_operations f2fs_dir_inode_operations;
+extern const struct inode_operations f2fs_symlink_inode_operations;
+extern const struct inode_operations f2fs_special_inode_operations;
+#endif
diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
new file mode 100644
index 0000000..5d525ed
--- /dev/null
+++ b/fs/f2fs/node.h
@@ -0,0 +1,353 @@
+/**
+ * fs/f2fs/node.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* start node id of a node block dedicated to the given node id */
+#define START_NID(nid) ((nid / NAT_ENTRY_PER_BLOCK) * NAT_ENTRY_PER_BLOCK)
+
+/* node block offset on the NAT area dedicated to the given start node id */
+#define NAT_BLOCK_OFFSET(start_nid) (start_nid / NAT_ENTRY_PER_BLOCK)
+
+/* # of pages to perform readahead before building free nids */
+#define FREE_NID_PAGES 4
+
+/* maximum # of free node ids to produce during build_free_nids */
+#define MAX_FREE_NIDS (NAT_ENTRY_PER_BLOCK * FREE_NID_PAGES)
+
+/* maximum readahead size for node during getting data blocks */
+#define MAX_RA_NODE 128
+
+/* maximum cached nat entries to manage memory footprint */
+#define NM_WOUT_THRESHOLD (64 * NAT_ENTRY_PER_BLOCK)
+
+/* vector size for gang look-up from nat cache that consists of radix tree */
+#define NATVEC_SIZE 64
+
+/*
+ * For node information
+ */
+struct node_info {
+ nid_t nid; /* node id */
+ nid_t ino; /* inode number of the node's owner */
+ block_t blk_addr; /* block address of the node */
+ unsigned char version; /* version of the node */
+};
+
+struct nat_entry {
+ struct list_head list; /* for clean or dirty nat list */
+ bool checkpointed; /* whether it is checkpointed or not */
+ struct node_info ni; /* in-memory node information */
+};
+
+#define nat_get_nid(nat) (nat->ni.nid)
+#define nat_set_nid(nat, n) (nat->ni.nid = n)
+#define nat_get_blkaddr(nat) (nat->ni.blk_addr)
+#define nat_set_blkaddr(nat, b) (nat->ni.blk_addr = b)
+#define nat_get_ino(nat) (nat->ni.ino)
+#define nat_set_ino(nat, i) (nat->ni.ino = i)
+#define nat_get_version(nat) (nat->ni.version)
+#define nat_set_version(nat, v) (nat->ni.version = v)
+
+#define __set_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->dirty_nat_entries);
+#define __clear_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->nat_entries);
+#define inc_node_version(version) (++version)
+
+static inline void node_info_from_raw_nat(struct node_info *ni,
+ struct f2fs_nat_entry *raw_ne)
+{
+ ni->ino = le32_to_cpu(raw_ne->ino);
+ ni->blk_addr = le32_to_cpu(raw_ne->block_addr);
+ ni->version = raw_ne->version;
+}
+
+/*
+ * For free nid mangement
+ */
+enum nid_state {
+ NID_NEW, /* newly added to free nid list */
+ NID_ALLOC /* it is allocated */
+};
+
+struct free_nid {
+ struct list_head list; /* for free node id list */
+ nid_t nid; /* node id */
+ int state; /* in use or not: NID_NEW or NID_ALLOC */
+};
+
+static inline int next_free_nid(struct f2fs_sb_info *sbi, nid_t *nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *fnid;
+
+ if (nm_i->fcnt <= 0)
+ return -1;
+ spin_lock(&nm_i->free_nid_list_lock);
+ fnid = list_entry(nm_i->free_nid_list.next, struct free_nid, list);
+ *nid = fnid->nid;
+ spin_unlock(&nm_i->free_nid_list_lock);
+ return 0;
+}
+
+/*
+ * inline functions
+ */
+static inline void get_nat_bitmap(struct f2fs_sb_info *sbi, void *addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ memcpy(addr, nm_i->nat_bitmap, nm_i->bitmap_size);
+}
+
+static inline pgoff_t current_nat_addr(struct f2fs_sb_info *sbi, nid_t start)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ pgoff_t block_off;
+ pgoff_t block_addr;
+ int seg_off;
+
+ block_off = NAT_BLOCK_OFFSET(start);
+ seg_off = block_off >> sbi->log_blocks_per_seg;
+
+ block_addr = (pgoff_t)(nm_i->nat_blkaddr +
+ (seg_off << sbi->log_blocks_per_seg << 1) +
+ (block_off & ((1 << sbi->log_blocks_per_seg) - 1)));
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr;
+}
+
+static inline pgoff_t next_nat_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+
+ block_addr -= nm_i->nat_blkaddr;
+ if ((block_addr >> sbi->log_blocks_per_seg) % 2)
+ block_addr -= sbi->blocks_per_seg;
+ else
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr + nm_i->nat_blkaddr;
+}
+
+static inline void set_to_next_nat(struct f2fs_nm_info *nm_i, nid_t start_nid)
+{
+ unsigned int block_off = NAT_BLOCK_OFFSET(start_nid);
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ f2fs_clear_bit(block_off, nm_i->nat_bitmap);
+ else
+ f2fs_set_bit(block_off, nm_i->nat_bitmap);
+}
+
+static inline void fill_node_footer(struct page *page, nid_t nid,
+ nid_t ino, unsigned int ofs, bool reset)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ if (reset)
+ memset(rn, 0, sizeof(*rn));
+ rn->footer.nid = cpu_to_le32(nid);
+ rn->footer.ino = cpu_to_le32(ino);
+ rn->footer.flag = cpu_to_le32(ofs << OFFSET_BIT_SHIFT);
+}
+
+static inline void copy_node_footer(struct page *dst, struct page *src)
+{
+ void *src_addr = page_address(src);
+ void *dst_addr = page_address(dst);
+ struct f2fs_node *src_rn = (struct f2fs_node *)src_addr;
+ struct f2fs_node *dst_rn = (struct f2fs_node *)dst_addr;
+ memcpy(&dst_rn->footer, &src_rn->footer, sizeof(struct node_footer));
+}
+
+static inline void fill_node_footer_blkaddr(struct page *page, block_t blkaddr)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ rn->footer.cp_ver = ckpt->checkpoint_ver;
+ rn->footer.next_blkaddr = blkaddr;
+}
+
+static inline nid_t ino_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.ino);
+}
+
+static inline nid_t nid_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.nid);
+}
+
+static inline unsigned int ofs_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned flag = le32_to_cpu(rn->footer.flag);
+ return flag >> OFFSET_BIT_SHIFT;
+}
+
+static inline unsigned long long cpver_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le64_to_cpu(rn->footer.cp_ver);
+}
+
+static inline block_t next_blkaddr_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.next_blkaddr);
+}
+
+/*
+ * f2fs assigns the following node offsets described as (num).
+ * N = NIDS_PER_BLOCK
+ *
+ * Inode block (0)
+ * |- direct node (1)
+ * |- direct node (2)
+ * |- indirect node (3)
+ * | `- direct node (4 => 4 + N - 1)
+ * |- indirect node (4 + N)
+ * | `- direct node (5 + N => 5 + 2N - 1)
+ * `- double indirect node (5 + 2N)
+ * `- indirect node (6 + 2N)
+ * `- direct node (x(N + 1))
+ */
+static inline bool IS_DNODE(struct page *node_page)
+{
+ unsigned int ofs = ofs_of_node(node_page);
+ if (ofs == 3 || ofs == 4 + NIDS_PER_BLOCK ||
+ ofs == 5 + 2 * NIDS_PER_BLOCK)
+ return false;
+ if (ofs >= 6 + 2 * NIDS_PER_BLOCK) {
+ ofs -= 6 + 2 * NIDS_PER_BLOCK;
+ if ((long int)ofs % (NIDS_PER_BLOCK + 1))
+ return false;
+ }
+ return true;
+}
+
+static inline void set_nid(struct page *p, int off, nid_t nid, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+
+ wait_on_page_writeback(p);
+
+ if (i)
+ rn->i.i_nid[off - NODE_DIR1_BLOCK] = cpu_to_le32(nid);
+ else
+ rn->in.nid[off] = cpu_to_le32(nid);
+ set_page_dirty(p);
+}
+
+static inline nid_t get_nid(struct page *p, int off, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+ if (i)
+ return le32_to_cpu(rn->i.i_nid[off - NODE_DIR1_BLOCK]);
+ return le32_to_cpu(rn->in.nid[off]);
+}
+
+/*
+ * Coldness identification:
+ * - Mark cold files in f2fs_inode_info
+ * - Mark cold node blocks in their node footer
+ * - Mark cold data pages in page cache
+ */
+static inline int is_cold_file(struct inode *inode)
+{
+ return F2FS_I(inode)->i_advise & FADVISE_COLD_BIT;
+}
+
+static inline int is_cold_data(struct page *page)
+{
+ return PageChecked(page);
+}
+
+static inline void set_cold_data(struct page *page)
+{
+ SetPageChecked(page);
+}
+
+static inline void clear_cold_data(struct page *page)
+{
+ ClearPageChecked(page);
+}
+
+static inline int is_cold_node(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << COLD_BIT_SHIFT);
+}
+
+static inline unsigned char is_fsync_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << FSYNC_BIT_SHIFT);
+}
+
+static inline unsigned char is_dent_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << DENT_BIT_SHIFT);
+}
+
+static inline void set_cold_node(struct inode *inode, struct page *page)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(page);
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+
+ if (S_ISDIR(inode->i_mode))
+ flag &= ~(0x1 << COLD_BIT_SHIFT);
+ else
+ flag |= (0x1 << COLD_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_fsync_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << FSYNC_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << FSYNC_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_dentry_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << DENT_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << DENT_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
new file mode 100644
index 0000000..6d19f65
--- /dev/null
+++ b/fs/f2fs/segment.h
@@ -0,0 +1,615 @@
+/**
+ * fs/f2fs/segment.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* constant macro */
+#define NULL_SEGNO ((unsigned int)(~0))
+
+/* V: Logical segment # in volume, R: Relative segment # in main area */
+#define GET_L2R_SEGNO(free_i, segno) (segno - free_i->start_segno)
+#define GET_R2L_SEGNO(free_i, segno) (segno + free_i->start_segno)
+
+#define IS_DATASEG(t) \
+ ((t == CURSEG_HOT_DATA) || (t == CURSEG_COLD_DATA) || \
+ (t == CURSEG_WARM_DATA))
+
+#define IS_NODESEG(t) \
+ ((t == CURSEG_HOT_NODE) || (t == CURSEG_COLD_NODE) || \
+ (t == CURSEG_WARM_NODE))
+
+#define IS_CURSEG(sbi, segno) \
+ ((segno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno))
+
+#define IS_CURSEC(sbi, secno) \
+ ((secno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno / \
+ sbi->segs_per_sec)) \
+
+#define START_BLOCK(sbi, segno) \
+ (SM_I(sbi)->seg0_blkaddr + \
+ (GET_R2L_SEGNO(FREE_I(sbi), segno) << sbi->log_blocks_per_seg))
+#define NEXT_FREE_BLKADDR(sbi, curseg) \
+ (START_BLOCK(sbi, curseg->segno) + curseg->next_blkoff)
+
+#define MAIN_BASE_BLOCK(sbi) (SM_I(sbi)->main_blkaddr)
+
+#define GET_SEGOFF_FROM_SEG0(sbi, blk_addr) \
+ ((blk_addr) - SM_I(sbi)->seg0_blkaddr)
+#define GET_SEGNO_FROM_SEG0(sbi, blk_addr) \
+ (GET_SEGOFF_FROM_SEG0(sbi, blk_addr) >> sbi->log_blocks_per_seg)
+#define GET_SEGNO(sbi, blk_addr) \
+ (((blk_addr == NULL_ADDR) || (blk_addr == NEW_ADDR)) ? \
+ NULL_SEGNO : GET_L2R_SEGNO(FREE_I(sbi), \
+ GET_SEGNO_FROM_SEG0(sbi, blk_addr)))
+#define GET_SECNO(sbi, segno) \
+ ((segno) / sbi->segs_per_sec)
+#define GET_ZONENO_FROM_SEGNO(sbi, segno) \
+ ((segno / sbi->segs_per_sec) / sbi->secs_per_zone)
+
+#define GET_SUM_BLOCK(sbi, segno) \
+ ((sbi->sm_info->ssa_blkaddr) + segno)
+
+#define GET_SUM_TYPE(footer) ((footer)->entry_type)
+#define SET_SUM_TYPE(footer, type) ((footer)->entry_type = type)
+
+#define SIT_ENTRY_OFFSET(sit_i, segno) \
+ (segno % sit_i->sents_per_block)
+#define SIT_BLOCK_OFFSET(sit_i, segno) \
+ (segno / SIT_ENTRY_PER_BLOCK)
+#define START_SEGNO(sit_i, segno) \
+ (SIT_BLOCK_OFFSET(sit_i, segno) * SIT_ENTRY_PER_BLOCK)
+#define f2fs_bitmap_size(nr) \
+ (BITS_TO_LONGS(nr) * sizeof(unsigned long))
+#define TOTAL_SEGS(sbi) (SM_I(sbi)->main_segments)
+
+/* during checkpoint, bio_private is used to synchronize the last bio */
+struct bio_private {
+ struct f2fs_sb_info *sbi;
+ bool is_sync;
+ void *wait;
+};
+
+/*
+ * indicate a block allocation direction: RIGHT and LEFT.
+ * RIGHT means allocating new sections towards the end of volume.
+ * LEFT means the opposite direction.
+ */
+enum {
+ ALLOC_RIGHT = 0,
+ ALLOC_LEFT
+};
+
+/*
+ * In the victim_sel_policy->alloc_mode, there are two block allocation modes.
+ * LFS writes data sequentially with cleaning operations.
+ * SSR (Slack Space Recycle) reuses obsolete space without cleaning operations.
+ */
+enum {
+ LFS = 0,
+ SSR
+};
+
+/*
+ * In the victim_sel_policy->gc_mode, there are two gc, aka cleaning, modes.
+ * GC_CB is based on cost-benefit algorithm.
+ * GC_GREEDY is based on greedy algorithm.
+ */
+enum {
+ GC_CB = 0,
+ GC_GREEDY
+};
+
+/*
+ * BG_GC means the background cleaning job.
+ * FG_GC means the on-demand cleaning job.
+ */
+enum {
+ BG_GC = 0,
+ FG_GC
+};
+
+/* for a function parameter to select a victim segment */
+struct victim_sel_policy {
+ int alloc_mode; /* LFS or SSR */
+ int gc_mode; /* GC_CB or GC_GREEDY */
+ unsigned long *dirty_segmap; /* dirty segment bitmap */
+ unsigned int offset; /* last scanned bitmap offset */
+ unsigned int ofs_unit; /* bitmap search unit */
+ unsigned int min_cost; /* minimum cost */
+ unsigned int min_segno; /* segment # having min. cost */
+};
+
+struct seg_entry {
+ unsigned short valid_blocks; /* # of valid blocks */
+ unsigned char *cur_valid_map; /* validity bitmap of blocks */
+ /*
+ * # of valid blocks and the validity bitmap stored in the the last
+ * checkpoint pack. This information is used by the SSR mode.
+ */
+ unsigned short ckpt_valid_blocks;
+ unsigned char *ckpt_valid_map;
+ unsigned char type; /* segment type such as CURSEG_XXX_TYPE */
+ unsigned long long mtime; /* modification time of the segment */
+};
+
+struct sec_entry {
+ unsigned int valid_blocks; /* # of valid blocks in a section */
+};
+
+struct segment_allocation {
+ void (*allocate_segment)(struct f2fs_sb_info *, int, bool);
+};
+
+struct sit_info {
+ const struct segment_allocation *s_ops;
+
+ block_t sit_base_addr; /* start block address of SIT area */
+ block_t sit_blocks; /* # of blocks used by SIT area */
+ block_t written_valid_blocks; /* # of valid blocks in main area */
+ char *sit_bitmap; /* SIT bitmap pointer */
+ unsigned int bitmap_size; /* SIT bitmap size */
+
+ unsigned long *dirty_sentries_bitmap; /* bitmap for dirty sentries */
+ unsigned int dirty_sentries; /* # of dirty sentries */
+ unsigned int sents_per_block; /* # of SIT entries per block */
+ struct mutex sentry_lock; /* to protect SIT cache */
+ struct seg_entry *sentries; /* SIT segment-level cache */
+ struct sec_entry *sec_entries; /* SIT section-level cache */
+
+ /* for cost-benefit algorithm in cleaning procedure */
+ unsigned long long elapsed_time; /* elapsed time after mount */
+ unsigned long long mounted_time; /* mount time */
+ unsigned long long min_mtime; /* min. modification time */
+ unsigned long long max_mtime; /* max. modification time */
+};
+
+struct free_segmap_info {
+ unsigned int start_segno; /* start segment number logically */
+ unsigned int free_segments; /* # of free segments */
+ unsigned int free_sections; /* # of free sections */
+ rwlock_t segmap_lock; /* free segmap lock */
+ unsigned long *free_segmap; /* free segment bitmap */
+ unsigned long *free_secmap; /* free section bitmap */
+};
+
+/* Notice: The order of dirty type is same with CURSEG_XXX in f2fs.h */
+enum dirty_type {
+ DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */
+ DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */
+ DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */
+ DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */
+ DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */
+ DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
+ DIRTY, /* to count # of dirty segments */
+ PRE, /* to count # of entirely obsolete segments */
+ NR_DIRTY_TYPE
+};
+
+struct dirty_seglist_info {
+ const struct victim_selection *v_ops; /* victim selction operation */
+ unsigned long *dirty_segmap[NR_DIRTY_TYPE];
+ struct mutex seglist_lock; /* lock for segment bitmaps */
+ int nr_dirty[NR_DIRTY_TYPE]; /* # of dirty segments */
+ unsigned long *victim_segmap[2]; /* BG_GC, FG_GC */
+};
+
+/* victim selection function for cleaning and SSR */
+struct victim_selection {
+ int (*get_victim)(struct f2fs_sb_info *, unsigned int *,
+ int, int, char);
+};
+
+/* for active log information */
+struct curseg_info {
+ struct mutex curseg_mutex; /* lock for consistency */
+ struct f2fs_summary_block *sum_blk; /* cached summary block */
+ unsigned char alloc_type; /* current allocation type */
+ unsigned int segno; /* current segment number */
+ unsigned short next_blkoff; /* next block offset to write */
+ unsigned int zone; /* current zone number */
+ unsigned int next_segno; /* preallocated segment */
+};
+
+/*
+ * inline functions
+ */
+static inline struct curseg_info *CURSEG_I(struct f2fs_sb_info *sbi, int type)
+{
+ return (struct curseg_info *)(SM_I(sbi)->curseg_array + type);
+}
+
+static inline struct seg_entry *get_seg_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sentries[segno];
+}
+
+static inline struct sec_entry *get_sec_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sec_entries[GET_SECNO(sbi, segno)];
+}
+
+static inline unsigned int get_valid_blocks(struct f2fs_sb_info *sbi,
+ unsigned int segno, int section)
+{
+ /*
+ * In order to get # of valid blocks in a section instantly from many
+ * segments, f2fs manages two counting structures separately.
+ */
+ if (section > 1)
+ return get_sec_entry(sbi, segno)->valid_blocks;
+ else
+ return get_seg_entry(sbi, segno)->valid_blocks;
+}
+
+static inline void seg_info_from_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ se->valid_blocks = GET_SIT_VBLOCKS(rs);
+ se->ckpt_valid_blocks = GET_SIT_VBLOCKS(rs);
+ memcpy(se->cur_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->type = GET_SIT_TYPE(rs);
+ se->mtime = le64_to_cpu(rs->mtime);
+}
+
+static inline void seg_info_to_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ unsigned short raw_vblocks = (se->type << SIT_VBLOCKS_SHIFT) |
+ se->valid_blocks;
+ rs->vblocks = cpu_to_le16(raw_vblocks);
+ memcpy(rs->valid_map, se->cur_valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->ckpt_valid_blocks = se->valid_blocks;
+ rs->mtime = cpu_to_le64(se->mtime);
+}
+
+static inline unsigned int find_next_inuse(struct free_segmap_info *free_i,
+ unsigned int max, unsigned int segno)
+{
+ unsigned int ret;
+ read_lock(&free_i->segmap_lock);
+ ret = find_next_bit(free_i->free_segmap, max, segno);
+ read_unlock(&free_i->segmap_lock);
+ return ret;
+}
+
+static inline void __set_free(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ clear_bit(segno, free_i->free_segmap);
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi), start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ clear_bit(secno, free_i->free_secmap);
+ free_i->free_sections++;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ set_bit(segno, free_i->free_segmap);
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+}
+
+static inline void __set_test_and_free(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ if (test_and_clear_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi),
+ start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ if (test_and_clear_bit(secno, free_i->free_secmap))
+ free_i->free_sections++;
+ }
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_test_and_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ write_lock(&free_i->segmap_lock);
+ if (!test_and_set_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void get_sit_bitmap(struct f2fs_sb_info *sbi,
+ void *dst_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ memcpy(dst_addr, sit_i->sit_bitmap, sit_i->bitmap_size);
+}
+
+static inline block_t written_block_count(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_t vblocks;
+
+ mutex_lock(&sit_i->sentry_lock);
+ vblocks = sit_i->written_valid_blocks;
+ mutex_unlock(&sit_i->sentry_lock);
+
+ return vblocks;
+}
+
+static inline unsigned int free_segments(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_segs;
+
+ read_lock(&free_i->segmap_lock);
+ free_segs = free_i->free_segments;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_segs;
+}
+
+static inline int reserved_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->reserved_segments;
+}
+
+static inline unsigned int free_sections(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_secs;
+
+ read_lock(&free_i->segmap_lock);
+ free_secs = free_i->free_sections;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_secs;
+}
+
+static inline unsigned int prefree_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[PRE];
+}
+
+static inline unsigned int dirty_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_NODE];
+}
+
+static inline int overprovision_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->ovp_segments;
+}
+
+static inline int overprovision_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) overprovision_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline int reserved_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) reserved_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline bool need_SSR(struct f2fs_sb_info *sbi)
+{
+ return (free_sections(sbi) < overprovision_sections(sbi));
+}
+
+static inline int get_ssr_segment(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return DIRTY_I(sbi)->v_ops->get_victim(sbi,
+ &(curseg)->next_segno, BG_GC, type, SSR);
+}
+
+static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi)
+{
+ return free_sections(sbi) <= reserved_sections(sbi);
+}
+
+static inline int utilization(struct f2fs_sb_info *sbi)
+{
+ return (long int)valid_user_blocks(sbi) * 100 /
+ (long int)sbi->user_block_count;
+}
+
+/*
+ * Sometimes f2fs may be better to drop out-of-place update policy.
+ * So, if fs utilization is over MIN_IPU_UTIL, then f2fs tries to write
+ * data in the original place likewise other traditional file systems.
+ * But, currently set 100 in percentage, which means it is disabled.
+ * See below need_inplace_update().
+ */
+#define MIN_IPU_UTIL 100
+static inline bool need_inplace_update(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ return false;
+ if (need_SSR(sbi) && utilization(sbi) > MIN_IPU_UTIL)
+ return true;
+ return false;
+}
+
+static inline unsigned int curseg_segno(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->segno;
+}
+
+static inline unsigned char curseg_alloc_type(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->alloc_type;
+}
+
+static inline unsigned short curseg_blkoff(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->next_blkoff;
+}
+
+static inline void check_seg_range(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ unsigned int end_segno = SM_I(sbi)->segment_count - 1;
+ BUG_ON(segno > end_segno);
+}
+
+/*
+ * This function is used for only debugging.
+ * NOTE: In future, we have to remove this function.
+ */
+static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ block_t total_blks = sm_info->segment_count << sbi->log_blocks_per_seg;
+ block_t start_addr = sm_info->seg0_blkaddr;
+ block_t end_addr = start_addr + total_blks - 1;
+ BUG_ON(blk_addr < start_addr);
+ BUG_ON(blk_addr > end_addr);
+}
+
+/*
+ * Summary block is always treated as invalid block
+ */
+static inline void check_block_count(struct f2fs_sb_info *sbi,
+ int segno, struct f2fs_sit_entry *raw_sit)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ unsigned int end_segno = sm_info->segment_count - 1;
+ int valid_blocks = 0;
+ int i;
+
+ /* check segment usage */
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg);
+
+ /* check boundary of a given segment number */
+ BUG_ON(segno > end_segno);
+
+ /* check bitmap with valid block count */
+ for (i = 0; i < sbi->blocks_per_seg; i++)
+ if (f2fs_test_bit(i, raw_sit->valid_map))
+ valid_blocks++;
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) != valid_blocks);
+}
+
+static inline pgoff_t current_sit_addr(struct f2fs_sb_info *sbi,
+ unsigned int start)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int offset = SIT_BLOCK_OFFSET(sit_i, start);
+ block_t blk_addr = sit_i->sit_base_addr + offset;
+
+ check_seg_range(sbi, start);
+
+ /* calculate sit block address */
+ if (f2fs_test_bit(offset, sit_i->sit_bitmap))
+ blk_addr += sit_i->sit_blocks;
+
+ return blk_addr;
+}
+
+static inline pgoff_t next_sit_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_addr -= sit_i->sit_base_addr;
+ if (block_addr < sit_i->sit_blocks)
+ block_addr += sit_i->sit_blocks;
+ else
+ block_addr -= sit_i->sit_blocks;
+
+ return block_addr + sit_i->sit_base_addr;
+}
+
+static inline void set_to_next_sit(struct sit_info *sit_i, unsigned int start)
+{
+ unsigned int block_off = SIT_BLOCK_OFFSET(sit_i, start);
+
+ if (f2fs_test_bit(block_off, sit_i->sit_bitmap))
+ f2fs_clear_bit(block_off, sit_i->sit_bitmap);
+ else
+ f2fs_set_bit(block_off, sit_i->sit_bitmap);
+}
+
+static inline unsigned long long get_mtime(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return sit_i->elapsed_time + CURRENT_TIME_SEC.tv_sec -
+ sit_i->mounted_time;
+}
+
+static inline void set_summary(struct f2fs_summary *sum, nid_t nid,
+ unsigned int ofs_in_node, unsigned char version)
+{
+ sum->nid = cpu_to_le32(nid);
+ sum->ofs_in_node = cpu_to_le16(ofs_in_node);
+ sum->version = version;
+}
+
+static inline block_t start_sum_block(struct f2fs_sb_info *sbi)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline block_t sum_blk_addr(struct f2fs_sb_info *sbi, int base, int type)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_total_block_count)
+ - (base + 1) + type;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:42:45

[permalink] [raw]

Subject: [PATCH 04/17] f2fs: add super block operations

This adds the implementation of superblock operations for f2fs, which includes
- init_f2fs_fs/exit_f2fs_fs
- f2fs_mount
- super_operations of f2fs

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/super.c | 656 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 656 insertions(+)
create mode 100644 fs/f2fs/super.c

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
new file mode 100644
index 0000000..8661c93
--- /dev/null
+++ b/fs/f2fs/super.c
@@ -0,0 +1,656 @@
+/**
+ * fs/f2fs/super.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/statfs.h>
+#include <linux/proc_fs.h>
+#include <linux/buffer_head.h>
+#include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/parser.h>
+#include <linux/mount.h>
+#include <linux/seq_file.h>
+#include <linux/random.h>
+#include <linux/exportfs.h>
+#include <linux/f2fs_fs.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "xattr.h"
+
+static struct kmem_cache *f2fs_inode_cachep;
+
+enum {
+ Opt_gc_background_off,
+ Opt_disable_roll_forward,
+ Opt_discard,
+ Opt_noheap,
+ Opt_nouser_xattr,
+ Opt_noacl,
+ Opt_active_logs,
+ Opt_disable_ext_identify,
+ Opt_err,
+};
+
+static match_table_t f2fs_tokens = {
+ {Opt_gc_background_off, "background_gc_off"},
+ {Opt_disable_roll_forward, "disable_roll_forward"},
+ {Opt_discard, "discard"},
+ {Opt_noheap, "no_heap"},
+ {Opt_nouser_xattr, "nouser_xattr"},
+ {Opt_noacl, "noacl"},
+ {Opt_active_logs, "active_logs=%u"},
+ {Opt_disable_ext_identify, "disable_ext_identify"},
+ {Opt_err, NULL},
+};
+
+static void init_once(void *foo)
+{
+ struct f2fs_inode_info *fi = (struct f2fs_inode_info *) foo;
+
+ memset(fi, 0, sizeof(*fi));
+ inode_init_once(&fi->vfs_inode);
+}
+
+static struct inode *f2fs_alloc_inode(struct super_block *sb)
+{
+ struct f2fs_inode_info *fi;
+
+ fi = kmem_cache_alloc(f2fs_inode_cachep, GFP_NOFS | __GFP_ZERO);
+ if (!fi)
+ return NULL;
+
+ init_once((void *) fi);
+
+ /* Initilize f2fs-specific inode info */
+ fi->vfs_inode.i_version = 1;
+ atomic_set(&fi->dirty_dents, 0);
+ fi->i_current_depth = 1;
+ fi->i_advise = 0;
+ rwlock_init(&fi->ext.ext_lock);
+
+ set_inode_flag(fi, FI_NEW_INODE);
+
+ return &fi->vfs_inode;
+}
+
+static void f2fs_i_callback(struct rcu_head *head)
+{
+ struct inode *inode = container_of(head, struct inode, i_rcu);
+ kmem_cache_free(f2fs_inode_cachep, F2FS_I(inode));
+}
+
+void f2fs_destroy_inode(struct inode *inode)
+{
+ call_rcu(&inode->i_rcu, f2fs_i_callback);
+}
+
+static void f2fs_put_super(struct super_block *sb)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+
+ f2fs_destroy_stats(sbi);
+ stop_gc_thread(sbi);
+
+ write_checkpoint(sbi, false, true);
+
+ iput(sbi->node_inode);
+ iput(sbi->meta_inode);
+
+ /* destroy f2fs internal modules */
+ destroy_node_manager(sbi);
+ destroy_segment_manager(sbi);
+
+ kfree(sbi->ckpt);
+
+ sb->s_fs_info = NULL;
+ brelse(sbi->raw_super_buf);
+ kfree(sbi);
+}
+
+int f2fs_sync_fs(struct super_block *sb, int sync)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ int ret = 0;
+
+ if (!sbi->s_dirty && !get_pages(sbi, F2FS_DIRTY_NODES))
+ return 0;
+
+ if (sync)
+ write_checkpoint(sbi, false, false);
+
+ return ret;
+}
+
+static int f2fs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ u64 id = huge_encode_dev(sb->s_bdev->bd_dev);
+ block_t total_count, user_block_count, start_count, ovp_count;
+
+ total_count = le64_to_cpu(sbi->raw_super->block_count);
+ user_block_count = sbi->user_block_count;
+ start_count = le32_to_cpu(sbi->raw_super->segment0_blkaddr);
+ ovp_count = SM_I(sbi)->ovp_segments << sbi->log_blocks_per_seg;
+ buf->f_type = F2FS_SUPER_MAGIC;
+ buf->f_bsize = sbi->blocksize;
+
+ buf->f_blocks = total_count - start_count;
+ buf->f_bfree = buf->f_blocks - valid_user_blocks(sbi) - ovp_count;
+ buf->f_bavail = user_block_count - valid_user_blocks(sbi);
+
+ buf->f_files = valid_inode_count(sbi);
+ buf->f_ffree = sbi->total_node_count - valid_node_count(sbi);
+
+ buf->f_namelen = F2FS_MAX_NAME_LEN;
+ buf->f_fsid.val[0] = (u32)id;
+ buf->f_fsid.val[1] = (u32)(id >> 32);
+
+ return 0;
+}
+
+static int f2fs_show_options(struct seq_file *seq, struct dentry *root)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(root->d_sb);
+
+ if (test_opt(sbi, BG_GC))
+ seq_puts(seq, ",background_gc_on");
+ else
+ seq_puts(seq, ",background_gc_off");
+ if (test_opt(sbi, DISABLE_ROLL_FORWARD))
+ seq_puts(seq, ",disable_roll_forward");
+ if (test_opt(sbi, DISCARD))
+ seq_puts(seq, ",discard");
+ if (test_opt(sbi, NOHEAP))
+ seq_puts(seq, ",no_heap_alloc");
+#ifdef CONFIG_F2FS_FS_XATTR
+ if (test_opt(sbi, XATTR_USER))
+ seq_puts(seq, ",user_xattr");
+ else
+ seq_puts(seq, ",nouser_xattr");
+#endif
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+ if (test_opt(sbi, POSIX_ACL))
+ seq_puts(seq, ",acl");
+ else
+ seq_puts(seq, ",noacl");
+#endif
+ if (test_opt(sbi, DISABLE_EXT_IDENTIFY))
+ seq_puts(seq, ",disable_ext_indentify");
+
+ seq_printf(seq, ",active_logs=%u", sbi->active_logs);
+
+ return 0;
+}
+
+static struct super_operations f2fs_sops = {
+ .alloc_inode = f2fs_alloc_inode,
+ .destroy_inode = f2fs_destroy_inode,
+ .write_inode = f2fs_write_inode,
+ .show_options = f2fs_show_options,
+ .evict_inode = f2fs_evict_inode,
+ .put_super = f2fs_put_super,
+ .sync_fs = f2fs_sync_fs,
+ .statfs = f2fs_statfs,
+};
+
+static struct inode *f2fs_nfs_get_inode(struct super_block *sb,
+ u64 ino, u32 generation)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode;
+
+ if (ino < F2FS_ROOT_INO(sbi))
+ return ERR_PTR(-ESTALE);
+
+ /*
+ * f2fs_iget isn't quite right if the inode is currently unallocated!
+ * However f2fs_iget currently does appropriate checks to handle stale
+ * inodes so everything is OK.
+ */
+ inode = f2fs_iget(sb, ino);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ if (generation && inode->i_generation != generation) {
+ /* we didn't find the right inode.. */
+ iput(inode);
+ return ERR_PTR(-ESTALE);
+ }
+ return inode;
+}
+
+static struct dentry *f2fs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+ f2fs_nfs_get_inode);
+}
+
+static struct dentry *f2fs_fh_to_parent(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+ f2fs_nfs_get_inode);
+}
+
+static const struct export_operations f2fs_export_ops = {
+ .fh_to_dentry = f2fs_fh_to_dentry,
+ .fh_to_parent = f2fs_fh_to_parent,
+ .get_parent = f2fs_get_parent,
+};
+
+static int parse_options(struct f2fs_sb_info *sbi, char *options)
+{
+ substring_t args[MAX_OPT_ARGS];
+ char *p;
+ int arg = 0;
+
+ if (!options)
+ return 0;
+
+ while ((p = strsep(&options, ",")) != NULL) {
+ int token;
+ if (!*p)
+ continue;
+ /*
+ * Initialize args struct so we know whether arg was
+ * found; some options take optional arguments.
+ */
+ args[0].to = args[0].from = NULL;
+ token = match_token(p, f2fs_tokens, args);
+
+ switch (token) {
+ case Opt_gc_background_off:
+ clear_opt(sbi, BG_GC);
+ break;
+ case Opt_disable_roll_forward:
+ set_opt(sbi, DISABLE_ROLL_FORWARD);
+ break;
+ case Opt_discard:
+ set_opt(sbi, DISCARD);
+ break;
+ case Opt_noheap:
+ set_opt(sbi, NOHEAP);
+ break;
+#ifdef CONFIG_F2FS_FS_XATTR
+ case Opt_nouser_xattr:
+ clear_opt(sbi, XATTR_USER);
+ break;
+#else
+ case Opt_nouser_xattr:
+ pr_info("nouser_xattr options not supported\n");
+ break;
+#endif
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+ case Opt_noacl:
+ clear_opt(sbi, POSIX_ACL);
+ break;
+#else
+ case Opt_noacl:
+ pr_info("noacl options not supported\n");
+ break;
+#endif
+ case Opt_active_logs:
+ if (args->from && match_int(args, &arg))
+ return -EINVAL;
+ if (arg != 2 && arg != 4 && arg != 6)
+ return -EINVAL;
+ sbi->active_logs = arg;
+ break;
+ case Opt_disable_ext_identify:
+ set_opt(sbi, DISABLE_EXT_IDENTIFY);
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static loff_t max_file_size(unsigned bits)
+{
+ loff_t result = ADDRS_PER_INODE;
+ loff_t leaf_count = ADDRS_PER_BLOCK;
+
+ /* two direct node blocks */
+ result += (leaf_count * 2);
+
+ /* two indirect node blocks */
+ leaf_count *= NIDS_PER_BLOCK;
+ result += (leaf_count * 2);
+
+ /* one double indirect node block */
+ leaf_count *= NIDS_PER_BLOCK;
+ result += leaf_count;
+
+ result <<= bits;
+ return result;
+}
+
+static int sanity_check_raw_super(struct f2fs_super_block *raw_super)
+{
+ unsigned int blocksize;
+
+ if (F2FS_SUPER_MAGIC != le32_to_cpu(raw_super->magic))
+ return 1;
+
+ /* Currently, support only 4KB block size */
+ blocksize = 1 << le32_to_cpu(raw_super->log_blocksize);
+ if (blocksize != PAGE_CACHE_SIZE)
+ return 1;
+ if (le32_to_cpu(raw_super->log_sectorsize) !=
+ F2FS_LOG_SECTOR_SIZE)
+ return 1;
+ if (le32_to_cpu(raw_super->log_sectors_per_block) !=
+ F2FS_LOG_SECTORS_PER_BLOCK)
+ return 1;
+ return 0;
+}
+
+static int sanity_check_ckpt(struct f2fs_super_block *raw_super,
+ struct f2fs_checkpoint *ckpt)
+{
+ unsigned int total, fsmeta;
+
+ total = le32_to_cpu(raw_super->segment_count);
+ fsmeta = le32_to_cpu(raw_super->segment_count_ckpt);
+ fsmeta += le32_to_cpu(raw_super->segment_count_sit);
+ fsmeta += le32_to_cpu(raw_super->segment_count_nat);
+ fsmeta += le32_to_cpu(ckpt->rsvd_segment_count);
+ fsmeta += le32_to_cpu(raw_super->segment_count_ssa);
+
+ if (fsmeta >= total)
+ return 1;
+ return 0;
+}
+
+static void init_sb_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *raw_super = sbi->raw_super;
+ int i;
+
+ sbi->log_sectors_per_block =
+ le32_to_cpu(raw_super->log_sectors_per_block);
+ sbi->log_blocksize = le32_to_cpu(raw_super->log_blocksize);
+ sbi->blocksize = 1 << sbi->log_blocksize;
+ sbi->log_blocks_per_seg = le32_to_cpu(raw_super->log_blocks_per_seg);
+ sbi->blocks_per_seg = 1 << sbi->log_blocks_per_seg;
+ sbi->segs_per_sec = le32_to_cpu(raw_super->segs_per_sec);
+ sbi->secs_per_zone = le32_to_cpu(raw_super->secs_per_zone);
+ sbi->total_sections = le32_to_cpu(raw_super->section_count);
+ sbi->total_node_count =
+ (le32_to_cpu(raw_super->segment_count_nat) / 2)
+ * sbi->blocks_per_seg * NAT_ENTRY_PER_BLOCK;
+ sbi->root_ino_num = le32_to_cpu(raw_super->root_ino);
+ sbi->node_ino_num = le32_to_cpu(raw_super->node_ino);
+ sbi->meta_ino_num = le32_to_cpu(raw_super->meta_ino);
+
+ for (i = 0; i < NR_COUNT_TYPE; i++)
+ atomic_set(&sbi->nr_pages[i], 0);
+}
+
+static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct f2fs_sb_info *sbi;
+ struct f2fs_super_block *raw_super;
+ struct buffer_head *raw_super_buf;
+ struct inode *root;
+ long err = -EINVAL;
+ int i;
+
+ /* allocate memory for f2fs-specific super block info */
+ sbi = kzalloc(sizeof(struct f2fs_sb_info), GFP_KERNEL);
+ if (!sbi)
+ return -ENOMEM;
+
+ /* set a temporary block size */
+ if (!sb_set_blocksize(sb, F2FS_BLKSIZE))
+ goto free_sbi;
+
+ /* read f2fs raw super block */
+ raw_super_buf = sb_bread(sb, 0);
+ if (!raw_super_buf) {
+ err = -EIO;
+ goto free_sbi;
+ }
+ raw_super = (struct f2fs_super_block *)
+ ((char *)raw_super_buf->b_data + F2FS_SUPER_OFFSET);
+
+ /* init some FS parameters */
+ sbi->active_logs = NR_CURSEG_TYPE;
+
+ set_opt(sbi, BG_GC);
+
+#ifdef CONFIG_F2FS_FS_XATTR
+ set_opt(sbi, XATTR_USER);
+#endif
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+ set_opt(sbi, POSIX_ACL);
+#endif
+ /* parse mount options */
+ if (parse_options(sbi, (char *)data))
+ goto free_sb_buf;
+
+ /* sanity checking of raw super */
+ if (sanity_check_raw_super(raw_super))
+ goto free_sb_buf;
+
+ sb->s_maxbytes = max_file_size(raw_super->log_blocksize);
+ sb->s_max_links = F2FS_LINK_MAX;
+ get_random_bytes(&sbi->s_next_generation, sizeof(u32));
+
+ sb->s_op = &f2fs_sops;
+ sb->s_xattr = f2fs_xattr_handlers;
+ sb->s_export_op = &f2fs_export_ops;
+ sb->s_magic = F2FS_SUPER_MAGIC;
+ sb->s_fs_info = sbi;
+ sb->s_time_gran = 1;
+ sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+ (test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
+ memcpy(sb->s_uuid, raw_super->uuid, sizeof(raw_super->uuid));
+
+ /* init f2fs-specific super block info */
+ sbi->sb = sb;
+ sbi->raw_super = raw_super;
+ sbi->raw_super_buf = raw_super_buf;
+ mutex_init(&sbi->gc_mutex);
+ mutex_init(&sbi->write_inode);
+ mutex_init(&sbi->writepages);
+ mutex_init(&sbi->cp_mutex);
+ for (i = 0; i < NR_LOCK_TYPE; i++)
+ mutex_init(&sbi->fs_lock[i]);
+ sbi->por_doing = 0;
+ spin_lock_init(&sbi->stat_lock);
+ init_rwsem(&sbi->bio_sem);
+ init_sb_info(sbi);
+
+ /* get an inode for meta space */
+ sbi->meta_inode = f2fs_iget(sb, F2FS_META_INO(sbi));
+ if (IS_ERR(sbi->meta_inode)) {
+ err = PTR_ERR(sbi->meta_inode);
+ goto free_sb_buf;
+ }
+
+ err = get_valid_checkpoint(sbi);
+ if (err)
+ goto free_meta_inode;
+
+ /* sanity checking of checkpoint */
+ err = -EINVAL;
+ if (sanity_check_ckpt(raw_super, sbi->ckpt))
+ goto free_cp;
+
+ sbi->total_valid_node_count =
+ le32_to_cpu(sbi->ckpt->valid_node_count);
+ sbi->total_valid_inode_count =
+ le32_to_cpu(sbi->ckpt->valid_inode_count);
+ sbi->user_block_count = le64_to_cpu(sbi->ckpt->user_block_count);
+ sbi->total_valid_block_count =
+ le64_to_cpu(sbi->ckpt->valid_block_count);
+ sbi->last_valid_block_count = sbi->total_valid_block_count;
+ sbi->alloc_valid_block_count = 0;
+ INIT_LIST_HEAD(&sbi->dir_inode_list);
+ spin_lock_init(&sbi->dir_inode_lock);
+
+ /* init super block */
+ if (!sb_set_blocksize(sb, sbi->blocksize))
+ goto free_cp;
+
+ init_orphan_info(sbi);
+
+ /* setup f2fs internal modules */
+ err = build_segment_manager(sbi);
+ if (err)
+ goto free_sm;
+ err = build_node_manager(sbi);
+ if (err)
+ goto free_nm;
+
+ build_gc_manager(sbi);
+
+ /* get an inode for node space */
+ sbi->node_inode = f2fs_iget(sb, F2FS_NODE_INO(sbi));
+ if (IS_ERR(sbi->node_inode)) {
+ err = PTR_ERR(sbi->node_inode);
+ goto free_nm;
+ }
+
+ /* if there are nt orphan nodes free them */
+ err = -EINVAL;
+ if (!(sbi->ckpt->ckpt_flags & CP_UMOUNT_FLAG) &&
+ recover_orphan_inodes(sbi))
+ goto free_node_inode;
+
+ /* read root inode and dentry */
+ root = f2fs_iget(sb, F2FS_ROOT_INO(sbi));
+ if (IS_ERR(root)) {
+ err = PTR_ERR(root);
+ goto free_node_inode;
+ }
+ if (!S_ISDIR(root->i_mode) || !root->i_blocks || !root->i_size)
+ goto free_root_inode;
+
+ sb->s_root = d_make_root(root); /* allocate root dentry */
+ if (!sb->s_root) {
+ err = -ENOMEM;
+ goto free_root_inode;
+ }
+
+ /* recover fsynced data */
+ if (!(sbi->ckpt->ckpt_flags & CP_UMOUNT_FLAG) &&
+ !test_opt(sbi, DISABLE_ROLL_FORWARD))
+ recover_fsync_data(sbi);
+
+ /* After POR, we can run background GC thread */
+ err = start_gc_thread(sbi);
+ if (err)
+ goto fail;
+
+ err = f2fs_build_stats(sbi);
+ if (err)
+ goto fail;
+
+ return 0;
+fail:
+ stop_gc_thread(sbi);
+free_root_inode:
+ dput(sb->s_root);
+ sb->s_root = NULL;
+free_node_inode:
+ iput(sbi->node_inode);
+free_nm:
+ destroy_node_manager(sbi);
+free_sm:
+ destroy_segment_manager(sbi);
+free_cp:
+ kfree(sbi->ckpt);
+free_meta_inode:
+ make_bad_inode(sbi->meta_inode);
+ iput(sbi->meta_inode);
+free_sb_buf:
+ brelse(raw_super_buf);
+free_sbi:
+ kfree(sbi);
+ return err;
+}
+
+static struct dentry *f2fs_mount(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *data)
+{
+ return mount_bdev(fs_type, flags, dev_name, data, f2fs_fill_super);
+}
+
+static struct file_system_type f2fs_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "f2fs",
+ .mount = f2fs_mount,
+ .kill_sb = kill_block_super,
+ .fs_flags = FS_REQUIRES_DEV,
+};
+
+static int init_inodecache(void)
+{
+ f2fs_inode_cachep = f2fs_kmem_cache_create("f2fs_inode_cache",
+ sizeof(struct f2fs_inode_info), NULL);
+ if (f2fs_inode_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+static void destroy_inodecache(void)
+{
+ /*
+ * Make sure all delayed rcu free inodes are flushed before we
+ * destroy cache.
+ */
+ rcu_barrier();
+ kmem_cache_destroy(f2fs_inode_cachep);
+}
+
+static int __init init_f2fs_fs(void)
+{
+ int err;
+
+ err = init_inodecache();
+ if (err)
+ goto fail;
+ err = create_node_manager_caches();
+ if (err)
+ goto fail;
+ err = create_gc_caches();
+ if (err)
+ goto fail;
+ err = create_checkpoint_caches();
+ if (err)
+ goto fail;
+ return register_filesystem(&f2fs_fs_type);
+fail:
+ return err;
+}
+
+static void __exit exit_f2fs_fs(void)
+{
+ destroy_root_stats();
+ unregister_filesystem(&f2fs_fs_type);
+ destroy_checkpoint_caches();
+ destroy_gc_caches();
+ destroy_node_manager_caches();
+ destroy_inodecache();
+}
+
+module_init(init_f2fs_fs)
+module_exit(exit_f2fs_fs)
+
+MODULE_AUTHOR("Samsung Electronics's Praesto Team");
+MODULE_DESCRIPTION("Flash Friendly File System");
+MODULE_LICENSE("GPL");
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:43:51

[permalink] [raw]

Subject: [PATCH 05/17] f2fs: add checkpoint operations

This adds functions required by the checkpoint operations.

Basically, f2fs adopts a roll-back model with checkpoint blocks written in the
CP area. The checkpoint procedure includes as follows.

- write_checkpoint()
1. block_operations() freezes VFS calls.
2. submit cached bios.
3. flush_nat_entries() writes NAT pages updated by dirty NAT entries.
4. flush_sit_entries() writes SIT pages updated by dirty SIT entries.
5. do_checkpoint() writes,
- checkpoint block (#0)
- orphan inode blocks
- summary blocks made by active logs
- checkpoint block (copy of #0)
6. unblock_opeations()

In order to provide an address space for meta pages, f2fs_sb_info has a special
inode, namely meta_inode. This patch also adds the address space operations for
meta_inode.

Signed-off-by: Chul Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/checkpoint.c | 792 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 792 insertions(+)
create mode 100644 fs/f2fs/checkpoint.c

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
new file mode 100644
index 0000000..ab743f9
--- /dev/null
+++ b/fs/f2fs/checkpoint.c
@@ -0,0 +1,792 @@
+/**
+ * fs/f2fs/checkpoint.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/mpage.h>
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+#include <linux/f2fs_fs.h>
+#include <linux/pagevec.h>
+#include <linux/swap.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+
+static struct kmem_cache *orphan_entry_slab;
+static struct kmem_cache *inode_entry_slab;
+
+/**
+ * We guarantee no failure on the returned page.
+ */
+struct page *grab_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
+{
+ struct address_space *mapping = sbi->meta_inode->i_mapping;
+ struct page *page = NULL;
+repeat:
+ page = grab_cache_page(mapping, index);
+ if (!page) {
+ cond_resched();
+ goto repeat;
+ }
+
+ /* We wait writeback only inside grab_meta_page() */
+ wait_on_page_writeback(page);
+ SetPageUptodate(page);
+ return page;
+}
+
+/**
+ * We guarantee no failure on the returned page.
+ */
+struct page *get_meta_page(struct f2fs_sb_info *sbi, pgoff_t index)
+{
+ struct address_space *mapping = sbi->meta_inode->i_mapping;
+ struct page *page;
+repeat:
+ page = grab_cache_page(mapping, index);
+ if (!page) {
+ cond_resched();
+ goto repeat;
+ }
+ if (f2fs_readpage(sbi, page, index, READ_SYNC)) {
+ f2fs_put_page(page, 1);
+ goto repeat;
+ }
+ mark_page_accessed(page);
+
+ /* We do not allow returning an errorneous page */
+ return page;
+}
+
+static int f2fs_write_meta_page(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ int err;
+
+ wait_on_page_writeback(page);
+
+ err = write_meta_page(sbi, page, wbc);
+ if (err) {
+ wbc->pages_skipped++;
+ set_page_dirty(page);
+ }
+
+ dec_page_count(sbi, F2FS_DIRTY_META);
+
+ /* In this case, we should not unlock this page */
+ if (err != AOP_WRITEPAGE_ACTIVATE)
+ unlock_page(page);
+ return err;
+}
+
+static int f2fs_write_meta_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
+ struct block_device *bdev = sbi->sb->s_bdev;
+ long written;
+
+ if (wbc->for_kupdate)
+ return 0;
+
+ if (get_pages(sbi, F2FS_DIRTY_META) == 0)
+ return 0;
+
+ /* if mounting is failed, skip writing node pages */
+ mutex_lock(&sbi->cp_mutex);
+ written = sync_meta_pages(sbi, META, bio_get_nr_vecs(bdev));
+ mutex_unlock(&sbi->cp_mutex);
+ wbc->nr_to_write -= written;
+ return 0;
+}
+
+long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
+ long nr_to_write)
+{
+ struct address_space *mapping = sbi->meta_inode->i_mapping;
+ pgoff_t index = 0, end = LONG_MAX;
+ struct pagevec pvec;
+ long nwritten = 0;
+ struct writeback_control wbc = {
+ .for_reclaim = 0,
+ };
+
+ pagevec_init(&pvec, 0);
+
+ while (index <= end) {
+ int i, nr_pages;
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+ if (nr_pages == 0)
+ break;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+ lock_page(page);
+ BUG_ON(page->mapping != mapping);
+ BUG_ON(!PageDirty(page));
+ clear_page_dirty_for_io(page);
+ f2fs_write_meta_page(page, &wbc);
+ if (nwritten++ >= nr_to_write)
+ break;
+ }
+ pagevec_release(&pvec);
+ cond_resched();
+ }
+
+ if (nwritten)
+ f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX);
+
+ return nwritten;
+}
+
+static int f2fs_set_meta_page_dirty(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
+
+ SetPageUptodate(page);
+ if (!PageDirty(page)) {
+ __set_page_dirty_nobuffers(page);
+ inc_page_count(sbi, F2FS_DIRTY_META);
+ F2FS_SET_SB_DIRT(sbi);
+ return 1;
+ }
+ return 0;
+}
+
+const struct address_space_operations f2fs_meta_aops = {
+ .writepage = f2fs_write_meta_page,
+ .writepages = f2fs_write_meta_pages,
+ .set_page_dirty = f2fs_set_meta_page_dirty,
+};
+
+int check_orphan_space(struct f2fs_sb_info *sbi)
+{
+ unsigned int max_orphans;
+ int err = 0;
+
+ /*
+ * considering 512 blocks in a segment 5 blocks are needed for cp
+ * and log segment summaries. Remaining blocks are used to keep
+ * orphan entries with the limitation one reserved segment
+ * for cp pack we can have max 1020*507 orphan entries
+ */
+ max_orphans = (sbi->blocks_per_seg - 5) * F2FS_ORPHANS_PER_BLOCK;
+ mutex_lock(&sbi->orphan_inode_mutex);
+ if (sbi->n_orphans >= max_orphans)
+ err = -ENOSPC;
+ mutex_unlock(&sbi->orphan_inode_mutex);
+ return err;
+}
+
+void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
+{
+ struct list_head *head, *this;
+ struct orphan_inode_entry *new = NULL, *orphan = NULL;
+
+ mutex_lock(&sbi->orphan_inode_mutex);
+ head = &sbi->orphan_inode_list;
+ list_for_each(this, head) {
+ orphan = list_entry(this, struct orphan_inode_entry, list);
+ if (orphan->ino == ino)
+ goto out;
+ if (orphan->ino > ino)
+ break;
+ orphan = NULL;
+ }
+retry:
+ new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC);
+ if (!new) {
+ cond_resched();
+ goto retry;
+ }
+ new->ino = ino;
+ INIT_LIST_HEAD(&new->list);
+
+ /* add new_oentry into list which is sorted by inode number */
+ if (orphan) {
+ struct orphan_inode_entry *prev;
+
+ /* get previous entry */
+ prev = list_entry(orphan->list.prev, typeof(*prev), list);
+ if (&prev->list != head)
+ /* insert new orphan inode entry */
+ list_add(&new->list, &prev->list);
+ else
+ list_add(&new->list, head);
+ } else {
+ list_add_tail(&new->list, head);
+ }
+ sbi->n_orphans++;
+out:
+ mutex_unlock(&sbi->orphan_inode_mutex);
+}
+
+void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
+{
+ struct list_head *this, *next, *head;
+ struct orphan_inode_entry *orphan;
+
+ mutex_lock(&sbi->orphan_inode_mutex);
+ head = &sbi->orphan_inode_list;
+ list_for_each_safe(this, next, head) {
+ orphan = list_entry(this, struct orphan_inode_entry, list);
+ if (orphan->ino == ino) {
+ list_del(&orphan->list);
+ kmem_cache_free(orphan_entry_slab, orphan);
+ sbi->n_orphans--;
+ break;
+ }
+ }
+ mutex_unlock(&sbi->orphan_inode_mutex);
+}
+
+static void recover_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino)
+{
+ struct inode *inode = f2fs_iget(sbi->sb, ino);
+ BUG_ON(IS_ERR(inode));
+ clear_nlink(inode);
+
+ /* truncate all the data during iput */
+ iput(inode);
+}
+
+int recover_orphan_inodes(struct f2fs_sb_info *sbi)
+{
+ block_t start_blk, orphan_blkaddr, i, j;
+
+ if (!(F2FS_CKPT(sbi)->ckpt_flags & CP_ORPHAN_PRESENT_FLAG))
+ return 0;
+
+ sbi->por_doing = 1;
+ start_blk = __start_cp_addr(sbi) + 1;
+ orphan_blkaddr = __start_sum_addr(sbi) - 1;
+
+ for (i = 0; i < orphan_blkaddr; i++) {
+ struct page *page = get_meta_page(sbi, start_blk + i);
+ struct f2fs_orphan_block *orphan_blk;
+
+ orphan_blk = (struct f2fs_orphan_block *)page_address(page);
+ for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) {
+ nid_t ino = le32_to_cpu(orphan_blk->ino[j]);
+ recover_orphan_inode(sbi, ino);
+ }
+ f2fs_put_page(page, 1);
+ }
+ /* clear Orphan Flag */
+ F2FS_CKPT(sbi)->ckpt_flags &= (~CP_ORPHAN_PRESENT_FLAG);
+ sbi->por_doing = 0;
+ return 0;
+}
+
+static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk)
+{
+ struct list_head *head, *this, *next;
+ struct f2fs_orphan_block *orphan_blk = NULL;
+ struct page *page = NULL;
+ unsigned int nentries = 0;
+ unsigned short index = 1;
+ unsigned short orphan_blocks;
+
+ orphan_blocks = (unsigned short)((sbi->n_orphans +
+ (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK);
+
+ mutex_lock(&sbi->orphan_inode_mutex);
+ head = &sbi->orphan_inode_list;
+
+ /* loop for each orphan inode entry and write them in Jornal block */
+ list_for_each_safe(this, next, head) {
+ struct orphan_inode_entry *orphan;
+
+ orphan = list_entry(this, struct orphan_inode_entry, list);
+
+ if (nentries == F2FS_ORPHANS_PER_BLOCK) {
+ /*
+ * an orphan block is full of 1020 entries,
+ * then we need to flush current orphan blocks
+ * and bring another one in memory
+ */
+ orphan_blk->blk_addr = cpu_to_le16(index);
+ orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
+ orphan_blk->entry_count = cpu_to_le32(nentries);
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+ index++;
+ start_blk++;
+ nentries = 0;
+ page = NULL;
+ }
+ if (page)
+ goto page_exist;
+
+ page = grab_meta_page(sbi, start_blk);
+ orphan_blk = (struct f2fs_orphan_block *)page_address(page);
+ memset(orphan_blk, 0, sizeof(*orphan_blk));
+page_exist:
+ orphan_blk->ino[nentries++] = cpu_to_le32(orphan->ino);
+ }
+ if (!page)
+ goto end;
+
+ orphan_blk->blk_addr = cpu_to_le16(index);
+ orphan_blk->blk_count = cpu_to_le16(orphan_blocks);
+ orphan_blk->entry_count = cpu_to_le32(nentries);
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+end:
+ mutex_unlock(&sbi->orphan_inode_mutex);
+}
+
+static struct page *validate_checkpoint(struct f2fs_sb_info *sbi,
+ block_t cp_addr, unsigned long long *version)
+{
+ struct page *cp_page_1, *cp_page_2 = NULL;
+ unsigned long blk_size = sbi->blocksize;
+ struct f2fs_checkpoint *cp_block;
+ unsigned long long cur_version = 0, pre_version = 0;
+ unsigned int crc = 0;
+ size_t crc_offset;
+
+ /* Read the 1st cp block in this CP pack */
+ cp_page_1 = get_meta_page(sbi, cp_addr);
+
+ /* get the version number */
+ cp_block = (struct f2fs_checkpoint *)page_address(cp_page_1);
+ crc_offset = le32_to_cpu(cp_block->checksum_offset);
+ if (crc_offset >= blk_size)
+ goto invalid_cp1;
+
+ crc = *(unsigned int *)((unsigned char *)cp_block + crc_offset);
+ if (!f2fs_crc_valid(crc, cp_block, crc_offset))
+ goto invalid_cp1;
+
+ pre_version = le64_to_cpu(cp_block->checkpoint_ver);
+
+ /* Read the 2nd cp block in this CP pack */
+ cp_addr += le64_to_cpu(cp_block->cp_pack_total_block_count) - 1;
+ cp_page_2 = get_meta_page(sbi, cp_addr);
+
+ cp_block = (struct f2fs_checkpoint *)page_address(cp_page_2);
+ crc_offset = le32_to_cpu(cp_block->checksum_offset);
+ if (crc_offset >= blk_size)
+ goto invalid_cp2;
+
+ crc = *(unsigned int *)((unsigned char *)cp_block + crc_offset);
+ if (!f2fs_crc_valid(crc, cp_block, crc_offset))
+ goto invalid_cp2;
+
+ cur_version = le64_to_cpu(cp_block->checkpoint_ver);
+
+ if (cur_version == pre_version) {
+ *version = cur_version;
+ f2fs_put_page(cp_page_2, 1);
+ return cp_page_1;
+ }
+invalid_cp2:
+ f2fs_put_page(cp_page_2, 1);
+invalid_cp1:
+ f2fs_put_page(cp_page_1, 1);
+ return NULL;
+}
+
+int get_valid_checkpoint(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_checkpoint *cp_block;
+ struct f2fs_super_block *fsb = sbi->raw_super;
+ struct page *cp1, *cp2, *cur_page;
+ unsigned long blk_size = sbi->blocksize;
+ unsigned long long cp1_version = 0, cp2_version = 0;
+ unsigned long long cp_start_blk_no;
+
+ sbi->ckpt = kzalloc(blk_size, GFP_KERNEL);
+ if (!sbi->ckpt)
+ return -ENOMEM;
+ /*
+ * Finding out valid cp block involves read both
+ * sets( cp pack1 and cp pack 2)
+ */
+ cp_start_blk_no = le32_to_cpu(fsb->cp_blkaddr);
+ cp1 = validate_checkpoint(sbi, cp_start_blk_no, &cp1_version);
+
+ /* The second checkpoint pack should start at the next segment */
+ cp_start_blk_no += 1 << le32_to_cpu(fsb->log_blocks_per_seg);
+ cp2 = validate_checkpoint(sbi, cp_start_blk_no, &cp2_version);
+
+ if (cp1 && cp2) {
+ if (ver_after(cp2_version, cp1_version))
+ cur_page = cp2;
+ else
+ cur_page = cp1;
+ } else if (cp1) {
+ cur_page = cp1;
+ } else if (cp2) {
+ cur_page = cp2;
+ } else {
+ goto fail_no_cp;
+ }
+
+ cp_block = (struct f2fs_checkpoint *)page_address(cur_page);
+ memcpy(sbi->ckpt, cp_block, blk_size);
+
+ f2fs_put_page(cp1, 1);
+ f2fs_put_page(cp2, 1);
+ return 0;
+
+fail_no_cp:
+ kfree(sbi->ckpt);
+ return -EINVAL;
+}
+
+void set_dirty_dir_page(struct inode *inode, struct page *page)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct list_head *head = &sbi->dir_inode_list;
+ struct dir_inode_entry *new;
+ struct list_head *this;
+
+ if (!S_ISDIR(inode->i_mode))
+ return;
+retry:
+ new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS);
+ if (!new) {
+ cond_resched();
+ goto retry;
+ }
+ new->inode = inode;
+ INIT_LIST_HEAD(&new->list);
+
+ spin_lock(&sbi->dir_inode_lock);
+ list_for_each(this, head) {
+ struct dir_inode_entry *entry;
+ entry = list_entry(this, struct dir_inode_entry, list);
+ if (entry->inode == inode) {
+ kmem_cache_free(inode_entry_slab, new);
+ goto out;
+ }
+ }
+ list_add_tail(&new->list, head);
+ sbi->n_dirty_dirs++;
+
+ BUG_ON(!S_ISDIR(inode->i_mode));
+out:
+ inc_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_inc_dirty_dents(inode);
+ SetPagePrivate(page);
+
+ spin_unlock(&sbi->dir_inode_lock);
+}
+
+void remove_dirty_dir_inode(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct list_head *head = &sbi->dir_inode_list;
+ struct list_head *this;
+
+ if (!S_ISDIR(inode->i_mode))
+ return;
+
+ spin_lock(&sbi->dir_inode_lock);
+ if (atomic_read(&F2FS_I(inode)->dirty_dents))
+ goto out;
+
+ list_for_each(this, head) {
+ struct dir_inode_entry *entry;
+ entry = list_entry(this, struct dir_inode_entry, list);
+ if (entry->inode == inode) {
+ list_del(&entry->list);
+ kmem_cache_free(inode_entry_slab, entry);
+ sbi->n_dirty_dirs--;
+ break;
+ }
+ }
+out:
+ spin_unlock(&sbi->dir_inode_lock);
+}
+
+void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi)
+{
+ struct list_head *head = &sbi->dir_inode_list;
+ struct dir_inode_entry *entry;
+ struct inode *inode;
+retry:
+ spin_lock(&sbi->dir_inode_lock);
+ if (list_empty(head)) {
+ spin_unlock(&sbi->dir_inode_lock);
+ return;
+ }
+ entry = list_entry(head->next, struct dir_inode_entry, list);
+ inode = igrab(entry->inode);
+ spin_unlock(&sbi->dir_inode_lock);
+ if (inode) {
+ filemap_flush(inode->i_mapping);
+ iput(inode);
+ } else {
+ /*
+ * We should submit bio, since it exists several
+ * wribacking dentry pages in the freeing inode.
+ */
+ f2fs_submit_bio(sbi, DATA, true);
+ }
+ goto retry;
+}
+
+/**
+ * Freeze all the FS-operations for checkpoint.
+ */
+void block_operations(struct f2fs_sb_info *sbi)
+{
+ int t;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = LONG_MAX,
+ .for_reclaim = 0,
+ };
+
+ /* Stop renaming operation */
+ mutex_lock_op(sbi, RENAME);
+ mutex_lock_op(sbi, DENTRY_OPS);
+
+retry_dents:
+ /* write all the dirty dentry pages */
+ sync_dirty_dir_inodes(sbi);
+
+ mutex_lock_op(sbi, DATA_WRITE);
+ if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
+ mutex_unlock_op(sbi, DATA_WRITE);
+ goto retry_dents;
+ }
+
+ /* block all the operations */
+ for (t = DATA_NEW; t <= NODE_TRUNC; t++)
+ mutex_lock_op(sbi, t);
+
+ mutex_lock(&sbi->write_inode);
+
+ /*
+ * POR: we should ensure that there is no dirty node pages
+ * until finishing nat/sit flush.
+ */
+retry:
+ sync_node_pages(sbi, 0, &wbc);
+
+ mutex_lock_op(sbi, NODE_WRITE);
+
+ if (get_pages(sbi, F2FS_DIRTY_NODES)) {
+ mutex_unlock_op(sbi, NODE_WRITE);
+ goto retry;
+ }
+ mutex_unlock(&sbi->write_inode);
+}
+
+static void unblock_operations(struct f2fs_sb_info *sbi)
+{
+ int t;
+ for (t = NODE_WRITE; t >= RENAME; t--)
+ mutex_unlock_op(sbi, t);
+}
+
+static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ nid_t last_nid = 0;
+ block_t start_blk;
+ struct page *cp_page;
+ unsigned int data_sum_blocks, orphan_blocks;
+ void *kaddr;
+ __u32 crc32 = 0;
+ int i;
+
+ /* Flush all the NAT/SIT pages */
+ while (get_pages(sbi, F2FS_DIRTY_META))
+ sync_meta_pages(sbi, META, LONG_MAX);
+
+ next_free_nid(sbi, &last_nid);
+
+ /*
+ * modify checkpoint
+ * version number is already updated
+ */
+ ckpt->elapsed_time = cpu_to_le64(get_mtime(sbi));
+ ckpt->valid_block_count = cpu_to_le64(valid_user_blocks(sbi));
+ ckpt->free_segment_count = cpu_to_le32(free_segments(sbi));
+ for (i = 0; i < 3; i++) {
+ ckpt->cur_node_segno[i] =
+ cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE));
+ ckpt->cur_node_blkoff[i] =
+ cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE));
+ ckpt->alloc_type[i + CURSEG_HOT_NODE] =
+ curseg_alloc_type(sbi, i + CURSEG_HOT_NODE);
+ }
+ for (i = 0; i < 3; i++) {
+ ckpt->cur_data_segno[i] =
+ cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA));
+ ckpt->cur_data_blkoff[i] =
+ cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA));
+ ckpt->alloc_type[i + CURSEG_HOT_DATA] =
+ curseg_alloc_type(sbi, i + CURSEG_HOT_DATA);
+ }
+
+ ckpt->valid_node_count = cpu_to_le32(valid_node_count(sbi));
+ ckpt->valid_inode_count = cpu_to_le32(valid_inode_count(sbi));
+ ckpt->next_free_nid = cpu_to_le32(last_nid);
+
+ /* 2 cp + n data seg summary + orphan inode blocks */
+ data_sum_blocks = npages_for_summary_flush(sbi);
+ if (data_sum_blocks < 3)
+ ckpt->ckpt_flags |= CP_COMPACT_SUM_FLAG;
+ else
+ ckpt->ckpt_flags &= (~CP_COMPACT_SUM_FLAG);
+
+ orphan_blocks = (sbi->n_orphans + F2FS_ORPHANS_PER_BLOCK - 1)
+ / F2FS_ORPHANS_PER_BLOCK;
+ ckpt->cp_pack_start_sum = 1 + orphan_blocks;
+ ckpt->cp_pack_total_block_count = 2 + data_sum_blocks + orphan_blocks;
+
+ if (is_umount) {
+ ckpt->ckpt_flags |= CP_UMOUNT_FLAG;
+ ckpt->cp_pack_total_block_count += NR_CURSEG_NODE_TYPE;
+ } else {
+ ckpt->ckpt_flags &= (~CP_UMOUNT_FLAG);
+ }
+
+ if (sbi->n_orphans)
+ ckpt->ckpt_flags |= CP_ORPHAN_PRESENT_FLAG;
+ else
+ ckpt->ckpt_flags &= (~CP_ORPHAN_PRESENT_FLAG);
+
+ /* update SIT/NAT bitmap */
+ get_sit_bitmap(sbi, __bitmap_ptr(sbi, SIT_BITMAP));
+ get_nat_bitmap(sbi, __bitmap_ptr(sbi, NAT_BITMAP));
+
+ crc32 = f2fs_crc32(ckpt, le32_to_cpu(ckpt->checksum_offset));
+ *(__u32 *)((unsigned char *)ckpt +
+ le32_to_cpu(ckpt->checksum_offset))
+ = cpu_to_le32(crc32);
+
+ start_blk = __start_cp_addr(sbi);
+
+ /* write out checkpoint buffer at block 0 */
+ cp_page = grab_meta_page(sbi, start_blk++);
+ kaddr = page_address(cp_page);
+ memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
+ set_page_dirty(cp_page);
+ f2fs_put_page(cp_page, 1);
+
+ if (sbi->n_orphans) {
+ write_orphan_inodes(sbi, start_blk);
+ start_blk += orphan_blocks;
+ }
+
+ write_data_summaries(sbi, start_blk);
+ start_blk += data_sum_blocks;
+ if (is_umount) {
+ write_node_summaries(sbi, start_blk);
+ start_blk += NR_CURSEG_NODE_TYPE;
+ }
+
+ /* writeout checkpoint block */
+ cp_page = grab_meta_page(sbi, start_blk);
+ kaddr = page_address(cp_page);
+ memcpy(kaddr, ckpt, (1 << sbi->log_blocksize));
+ set_page_dirty(cp_page);
+ f2fs_put_page(cp_page, 1);
+
+ /* wait for previous submitted node/meta pages writeback */
+ while (get_pages(sbi, F2FS_WRITEBACK))
+ congestion_wait(BLK_RW_ASYNC, HZ / 50);
+
+ filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
+ filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
+
+ /* update user_block_counts */
+ sbi->last_valid_block_count = sbi->total_valid_block_count;
+ sbi->alloc_valid_block_count = 0;
+
+ /* Here, we only have one bio having CP pack */
+ if (sbi->ckpt->ckpt_flags & CP_ERROR_FLAG)
+ sbi->sb->s_flags |= MS_RDONLY;
+ else
+ sync_meta_pages(sbi, META_FLUSH, LONG_MAX);
+
+ clear_prefree_segments(sbi);
+ F2FS_RESET_SB_DIRT(sbi);
+}
+
+/**
+ * We guarantee that this checkpoint procedure should not fail.
+ */
+void write_checkpoint(struct f2fs_sb_info *sbi, bool blocked, bool is_umount)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ unsigned long long ckpt_ver;
+
+ if (!blocked) {
+ mutex_lock(&sbi->cp_mutex);
+ block_operations(sbi);
+ }
+
+ f2fs_submit_bio(sbi, DATA, true);
+ f2fs_submit_bio(sbi, NODE, true);
+ f2fs_submit_bio(sbi, META, true);
+
+ /*
+ * update checkpoint pack index
+ * Increase the version number so that
+ * SIT entries and seg summaries are written at correct place
+ */
+ ckpt_ver = le64_to_cpu(ckpt->checkpoint_ver);
+ ckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver);
+
+ /* write cached NAT/SIT entries to NAT/SIT area */
+ flush_nat_entries(sbi);
+ flush_sit_entries(sbi);
+
+ reset_victim_segmap(sbi);
+
+ /* unlock all the fs_lock[] in do_checkpoint() */
+ do_checkpoint(sbi, is_umount);
+
+ unblock_operations(sbi);
+ mutex_unlock(&sbi->cp_mutex);
+}
+
+void init_orphan_info(struct f2fs_sb_info *sbi)
+{
+ mutex_init(&sbi->orphan_inode_mutex);
+ INIT_LIST_HEAD(&sbi->orphan_inode_list);
+ sbi->n_orphans = 0;
+}
+
+int create_checkpoint_caches(void)
+{
+ orphan_entry_slab = f2fs_kmem_cache_create("f2fs_orphan_entry",
+ sizeof(struct orphan_inode_entry), NULL);
+ if (unlikely(!orphan_entry_slab))
+ return -ENOMEM;
+ inode_entry_slab = f2fs_kmem_cache_create("f2fs_dirty_dir_entry",
+ sizeof(struct dir_inode_entry), NULL);
+ if (unlikely(!inode_entry_slab)) {
+ kmem_cache_destroy(orphan_entry_slab);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void destroy_checkpoint_caches(void)
+{
+ kmem_cache_destroy(orphan_entry_slab);
+ kmem_cache_destroy(inode_entry_slab);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:44:23

[permalink] [raw]

Subject: [PATCH 06/17] f2fs: add node operations

This adds specific functions to manage NAT pages, a cache for NAT entries, free
nids, direct/indirect node blocks for indexing data, and address space for node
pages.

- The key information of an NAT entry consists of a node id and a block address.

- An NAT page is composed of block addresses covered by a certain range of NAT
entries, which is maintained by the address space of meta_inode.

- A radix tree structure is used to cache NAT entries. The index for the tree
is a node id.

- When there is no free nid, F2FS should scan NAT entries to find new one. In
order to avoid scanning frequently, F2FS manages a list containing a number of
free nids in memory. Only when free nids in the list are exhausted, scanning
process, build_free_nids(), is triggered.

- F2FS has direct and indirect node blocks for indexing data. This patch adds
fuctions related to the node block management such as getting, allocating, and
truncating node blocks to index data.

- In order to cache node blocks in memory, F2FS has a node_inode with an address
space for node pages. This patch also adds the address space operations for
node_inode.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/node.c | 1763 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1763 insertions(+)
create mode 100644 fs/f2fs/node.c

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
new file mode 100644
index 0000000..216f04d
--- /dev/null
+++ b/fs/f2fs/node.c
@@ -0,0 +1,1763 @@
+/**
+ * fs/f2fs/node.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
+#include <linux/pagevec.h>
+#include <linux/swap.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+
+static struct kmem_cache *nat_entry_slab;
+static struct kmem_cache *free_nid_slab;
+
+static void clear_node_page_dirty(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
+ unsigned int long flags;
+
+ if (PageDirty(page)) {
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ radix_tree_tag_clear(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_DIRTY);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+
+ clear_page_dirty_for_io(page);
+ dec_page_count(sbi, F2FS_DIRTY_NODES);
+ }
+ ClearPageUptodate(page);
+}
+
+static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ pgoff_t index = current_nat_addr(sbi, nid);
+ return get_meta_page(sbi, index);
+}
+
+static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ struct page *src_page;
+ struct page *dst_page;
+ pgoff_t src_off;
+ pgoff_t dst_off;
+ void *src_addr;
+ void *dst_addr;
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+
+ src_off = current_nat_addr(sbi, nid);
+ dst_off = next_nat_addr(sbi, src_off);
+
+ /* get current nat block page with lock */
+ src_page = get_meta_page(sbi, src_off);
+
+ /* Dirty src_page means that it is already the new target NAT page. */
+ if (PageDirty(src_page))
+ return src_page;
+
+ dst_page = grab_meta_page(sbi, dst_off);
+
+ src_addr = page_address(src_page);
+ dst_addr = page_address(dst_page);
+ memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE);
+ set_page_dirty(dst_page);
+ f2fs_put_page(src_page, 1);
+
+ set_to_next_nat(nm_i, nid);
+
+ return dst_page;
+}
+
+/**
+ * Readahead NAT pages
+ */
+static void ra_nat_pages(struct f2fs_sb_info *sbi, int nid)
+{
+ struct address_space *mapping = sbi->meta_inode->i_mapping;
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct page *page;
+ pgoff_t index;
+ int i;
+
+ for (i = 0; i < FREE_NID_PAGES; i++, nid += NAT_ENTRY_PER_BLOCK) {
+ if (nid >= nm_i->max_nid)
+ nid = 0;
+ index = current_nat_addr(sbi, nid);
+
+ page = grab_cache_page(mapping, index);
+ if (!page)
+ continue;
+ if (f2fs_readpage(sbi, page, index, READ)) {
+ f2fs_put_page(page, 1);
+ continue;
+ }
+ page_cache_release(page);
+ }
+}
+
+static struct nat_entry *__lookup_nat_cache(struct f2fs_nm_info *nm_i, nid_t n)
+{
+ return radix_tree_lookup(&nm_i->nat_root, n);
+}
+
+static unsigned int __gang_lookup_nat_cache(struct f2fs_nm_info *nm_i,
+ nid_t start, unsigned int nr, struct nat_entry **ep)
+{
+ return radix_tree_gang_lookup(&nm_i->nat_root, (void **)ep, start, nr);
+}
+
+static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e)
+{
+ list_del(&e->list);
+ radix_tree_delete(&nm_i->nat_root, nat_get_nid(e));
+ nm_i->nat_cnt--;
+ kmem_cache_free(nat_entry_slab, e);
+}
+
+int is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct nat_entry *e;
+ int is_cp = 1;
+
+ read_lock(&nm_i->nat_tree_lock);
+ e = __lookup_nat_cache(nm_i, nid);
+ if (e && !e->checkpointed)
+ is_cp = 0;
+ read_unlock(&nm_i->nat_tree_lock);
+ return is_cp;
+}
+
+static struct nat_entry *grab_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid)
+{
+ struct nat_entry *new;
+
+ new = kmem_cache_alloc(nat_entry_slab, GFP_ATOMIC);
+ if (!new)
+ return NULL;
+ if (radix_tree_insert(&nm_i->nat_root, nid, new)) {
+ kmem_cache_free(nat_entry_slab, new);
+ return NULL;
+ }
+ memset(new, 0, sizeof(struct nat_entry));
+ nat_set_nid(new, nid);
+ list_add_tail(&new->list, &nm_i->nat_entries);
+ nm_i->nat_cnt++;
+ return new;
+}
+
+static void cache_nat_entry(struct f2fs_nm_info *nm_i, nid_t nid,
+ struct f2fs_nat_entry *ne)
+{
+ struct nat_entry *e;
+retry:
+ write_lock(&nm_i->nat_tree_lock);
+ e = __lookup_nat_cache(nm_i, nid);
+ if (!e) {
+ e = grab_nat_entry(nm_i, nid);
+ if (!e) {
+ write_unlock(&nm_i->nat_tree_lock);
+ goto retry;
+ }
+ nat_set_blkaddr(e, le32_to_cpu(ne->block_addr));
+ nat_set_ino(e, le32_to_cpu(ne->ino));
+ nat_set_version(e, ne->version);
+ e->checkpointed = true;
+ }
+ write_unlock(&nm_i->nat_tree_lock);
+}
+
+static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni,
+ block_t new_blkaddr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct nat_entry *e;
+retry:
+ write_lock(&nm_i->nat_tree_lock);
+ e = __lookup_nat_cache(nm_i, ni->nid);
+ if (!e) {
+ e = grab_nat_entry(nm_i, ni->nid);
+ if (!e) {
+ write_unlock(&nm_i->nat_tree_lock);
+ goto retry;
+ }
+ e->ni = *ni;
+ e->checkpointed = true;
+ BUG_ON(ni->blk_addr == NEW_ADDR);
+ } else if (new_blkaddr == NEW_ADDR) {
+ /*
+ * when nid is reallocated,
+ * previous nat entry can be remained in nat cache.
+ * So, reinitialize it with new information.
+ */
+ e->ni = *ni;
+ BUG_ON(ni->blk_addr != NULL_ADDR);
+ }
+
+ if (new_blkaddr == NEW_ADDR)
+ e->checkpointed = false;
+
+ /* sanity check */
+ BUG_ON(nat_get_blkaddr(e) != ni->blk_addr);
+ BUG_ON(nat_get_blkaddr(e) == NULL_ADDR &&
+ new_blkaddr == NULL_ADDR);
+ BUG_ON(nat_get_blkaddr(e) == NEW_ADDR &&
+ new_blkaddr == NEW_ADDR);
+ BUG_ON(nat_get_blkaddr(e) != NEW_ADDR &&
+ nat_get_blkaddr(e) != NULL_ADDR &&
+ new_blkaddr == NEW_ADDR);
+
+ /* increament version no as node is removed */
+ if (nat_get_blkaddr(e) != NEW_ADDR && new_blkaddr == NULL_ADDR) {
+ unsigned char version = nat_get_version(e);
+ nat_set_version(e, inc_node_version(version));
+ }
+
+ /* change address */
+ nat_set_blkaddr(e, new_blkaddr);
+ __set_nat_cache_dirty(nm_i, e);
+ write_unlock(&nm_i->nat_tree_lock);
+}
+
+static int try_to_free_nats(struct f2fs_sb_info *sbi, int nr_shrink)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+
+ if (nm_i->nat_cnt < 2 * NM_WOUT_THRESHOLD)
+ return 0;
+
+ write_lock(&nm_i->nat_tree_lock);
+ while (nr_shrink && !list_empty(&nm_i->nat_entries)) {
+ struct nat_entry *ne;
+ ne = list_first_entry(&nm_i->nat_entries,
+ struct nat_entry, list);
+ __del_from_nat_cache(nm_i, ne);
+ nr_shrink--;
+ }
+ write_unlock(&nm_i->nat_tree_lock);
+ return nr_shrink;
+}
+
+/**
+ * This function returns always success
+ */
+void get_node_info(struct f2fs_sb_info *sbi, nid_t nid, struct node_info *ni)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ nid_t start_nid = START_NID(nid);
+ struct f2fs_nat_block *nat_blk;
+ struct page *page = NULL;
+ struct f2fs_nat_entry ne;
+ struct nat_entry *e;
+ int i;
+
+ ni->nid = nid;
+
+ /* Check nat cache */
+ read_lock(&nm_i->nat_tree_lock);
+ e = __lookup_nat_cache(nm_i, nid);
+ if (e) {
+ ni->ino = nat_get_ino(e);
+ ni->blk_addr = nat_get_blkaddr(e);
+ ni->version = nat_get_version(e);
+ }
+ read_unlock(&nm_i->nat_tree_lock);
+ if (e)
+ return;
+
+ /* Check current segment summary */
+ mutex_lock(&curseg->curseg_mutex);
+ i = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 0);
+ if (i >= 0) {
+ ne = nat_in_journal(sum, i);
+ node_info_from_raw_nat(ni, &ne);
+ }
+ mutex_unlock(&curseg->curseg_mutex);
+ if (i >= 0)
+ goto cache;
+
+ /* Fill node_info from nat page */
+ page = get_current_nat_page(sbi, start_nid);
+ nat_blk = (struct f2fs_nat_block *)page_address(page);
+ ne = nat_blk->entries[nid - start_nid];
+ node_info_from_raw_nat(ni, &ne);
+ f2fs_put_page(page, 1);
+cache:
+ /* cache nat entry */
+ cache_nat_entry(NM_I(sbi), nid, &ne);
+}
+
+/**
+ * The maximum depth is four.
+ * Offset[0] will have raw inode offset.
+ */
+static int get_node_path(long block, int offset[4], unsigned int noffset[4])
+{
+ const long direct_index = ADDRS_PER_INODE;
+ const long direct_blks = ADDRS_PER_BLOCK;
+ const long dptrs_per_blk = NIDS_PER_BLOCK;
+ const long indirect_blks = ADDRS_PER_BLOCK * NIDS_PER_BLOCK;
+ const long dindirect_blks = indirect_blks * NIDS_PER_BLOCK;
+ int n = 0;
+ int level = 0;
+
+ noffset[0] = 0;
+
+ if (block < direct_index) {
+ offset[n++] = block;
+ level = 0;
+ goto got;
+ }
+ block -= direct_index;
+ if (block < direct_blks) {
+ offset[n++] = NODE_DIR1_BLOCK;
+ noffset[n] = 1;
+ offset[n++] = block;
+ level = 1;
+ goto got;
+ }
+ block -= direct_blks;
+ if (block < direct_blks) {
+ offset[n++] = NODE_DIR2_BLOCK;
+ noffset[n] = 2;
+ offset[n++] = block;
+ level = 1;
+ goto got;
+ }
+ block -= direct_blks;
+ if (block < indirect_blks) {
+ offset[n++] = NODE_IND1_BLOCK;
+ noffset[n] = 3;
+ offset[n++] = block / direct_blks;
+ noffset[n] = 4 + offset[n - 1];
+ offset[n++] = block % direct_blks;
+ level = 2;
+ goto got;
+ }
+ block -= indirect_blks;
+ if (block < indirect_blks) {
+ offset[n++] = NODE_IND2_BLOCK;
+ noffset[n] = 4 + dptrs_per_blk;
+ offset[n++] = block / direct_blks;
+ noffset[n] = 5 + dptrs_per_blk + offset[n - 1];
+ offset[n++] = block % direct_blks;
+ level = 2;
+ goto got;
+ }
+ block -= indirect_blks;
+ if (block < dindirect_blks) {
+ offset[n++] = NODE_DIND_BLOCK;
+ noffset[n] = 5 + (dptrs_per_blk * 2);
+ offset[n++] = block / indirect_blks;
+ noffset[n] = 6 + (dptrs_per_blk * 2) +
+ offset[n - 1] * (dptrs_per_blk + 1);
+ offset[n++] = (block / direct_blks) % dptrs_per_blk;
+ noffset[n] = 7 + (dptrs_per_blk * 2) +
+ offset[n - 2] * (dptrs_per_blk + 1) +
+ offset[n - 1];
+ offset[n++] = block % direct_blks;
+ level = 3;
+ goto got;
+ } else {
+ BUG();
+ }
+got:
+ return level;
+}
+
+/*
+ * Caller should call f2fs_put_dnode(dn).
+ */
+int get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int ro)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct page *npage[4];
+ struct page *parent;
+ int offset[4];
+ unsigned int noffset[4];
+ nid_t nids[4];
+ int level, i;
+ int err = 0;
+
+ level = get_node_path(index, offset, noffset);
+
+ nids[0] = dn->inode->i_ino;
+ npage[0] = get_node_page(sbi, nids[0]);
+ if (IS_ERR(npage[0]))
+ return PTR_ERR(npage[0]);
+
+ parent = npage[0];
+ nids[1] = get_nid(parent, offset[0], true);
+ dn->inode_page = npage[0];
+ dn->inode_page_locked = true;
+
+ /* get indirect or direct nodes */
+ for (i = 1; i <= level; i++) {
+ bool done = false;
+
+ if (!nids[i] && !ro) {
+ mutex_lock_op(sbi, NODE_NEW);
+
+ /* alloc new node */
+ if (!alloc_nid(sbi, &(nids[i]))) {
+ mutex_unlock_op(sbi, NODE_NEW);
+ err = -ENOSPC;
+ goto release_pages;
+ }
+
+ dn->nid = nids[i];
+ npage[i] = new_node_page(dn, noffset[i]);
+ if (IS_ERR(npage[i])) {
+ alloc_nid_failed(sbi, nids[i]);
+ mutex_unlock_op(sbi, NODE_NEW);
+ err = PTR_ERR(npage[i]);
+ goto release_pages;
+ }
+
+ set_nid(parent, offset[i - 1], nids[i], i == 1);
+ alloc_nid_done(sbi, nids[i]);
+ mutex_unlock_op(sbi, NODE_NEW);
+ done = true;
+ } else if (ro && i == level && level > 1) {
+ npage[i] = get_node_page_ra(parent, offset[i - 1]);
+ if (IS_ERR(npage[i])) {
+ err = PTR_ERR(npage[i]);
+ goto release_pages;
+ }
+ done = true;
+ }
+ if (i == 1) {
+ dn->inode_page_locked = false;
+ unlock_page(parent);
+ } else {
+ f2fs_put_page(parent, 1);
+ }
+
+ if (!done) {
+ npage[i] = get_node_page(sbi, nids[i]);
+ if (IS_ERR(npage[i])) {
+ err = PTR_ERR(npage[i]);
+ f2fs_put_page(npage[0], 0);
+ goto release_out;
+ }
+ }
+ if (i < level) {
+ parent = npage[i];
+ nids[i + 1] = get_nid(parent, offset[i], false);
+ }
+ }
+ dn->nid = nids[level];
+ dn->ofs_in_node = offset[level];
+ dn->node_page = npage[level];
+ dn->data_blkaddr = datablock_addr(dn->node_page, dn->ofs_in_node);
+ return 0;
+
+release_pages:
+ f2fs_put_page(parent, 1);
+ if (i > 1)
+ f2fs_put_page(npage[0], 0);
+release_out:
+ dn->inode_page = NULL;
+ dn->node_page = NULL;
+ return err;
+}
+
+static void truncate_node(struct dnode_of_data *dn)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct node_info ni;
+
+ get_node_info(sbi, dn->nid, &ni);
+ BUG_ON(ni.blk_addr == NULL_ADDR);
+
+ if (ni.blk_addr != NULL_ADDR)
+ invalidate_blocks(sbi, ni.blk_addr);
+
+ /* Deallocate node address */
+ dec_valid_node_count(sbi, dn->inode, 1);
+ set_node_addr(sbi, &ni, NULL_ADDR);
+
+ if (dn->nid == dn->inode->i_ino) {
+ remove_orphan_inode(sbi, dn->nid);
+ dec_valid_inode_count(sbi);
+ } else {
+ sync_inode_page(dn);
+ }
+
+ clear_node_page_dirty(dn->node_page);
+ F2FS_SET_SB_DIRT(sbi);
+
+ f2fs_put_page(dn->node_page, 1);
+ dn->node_page = NULL;
+}
+
+static int truncate_dnode(struct dnode_of_data *dn)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct page *page;
+
+ if (dn->nid == 0)
+ return 1;
+
+ /* get direct node */
+ page = get_node_page(sbi, dn->nid);
+ if (IS_ERR(page) && PTR_ERR(page) == -ENOENT)
+ return 1;
+ else if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ /* Make dnode_of_data for parameter */
+ dn->node_page = page;
+ dn->ofs_in_node = 0;
+ truncate_data_blocks(dn);
+ truncate_node(dn);
+ return 1;
+}
+
+static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs,
+ int ofs, int depth)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct dnode_of_data rdn = *dn;
+ struct page *page;
+ struct f2fs_node *rn;
+ nid_t child_nid;
+ unsigned int child_nofs;
+ int freed = 0;
+ int i, ret;
+
+ if (dn->nid == 0)
+ return NIDS_PER_BLOCK + 1;
+
+ page = get_node_page(sbi, dn->nid);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ rn = (struct f2fs_node *)page_address(page);
+ if (depth < 3) {
+ for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) {
+ child_nid = le32_to_cpu(rn->in.nid[i]);
+ if (child_nid == 0)
+ continue;
+ rdn.nid = child_nid;
+ ret = truncate_dnode(&rdn);
+ if (ret < 0)
+ goto out_err;
+ set_nid(page, i, 0, false);
+ }
+ } else {
+ child_nofs = nofs + ofs * (NIDS_PER_BLOCK + 1) + 1;
+ for (i = ofs; i < NIDS_PER_BLOCK; i++) {
+ child_nid = le32_to_cpu(rn->in.nid[i]);
+ if (child_nid == 0) {
+ child_nofs += NIDS_PER_BLOCK + 1;
+ continue;
+ }
+ rdn.nid = child_nid;
+ ret = truncate_nodes(&rdn, child_nofs, 0, depth - 1);
+ if (ret == (NIDS_PER_BLOCK + 1)) {
+ set_nid(page, i, 0, false);
+ child_nofs += ret;
+ } else if (ret < 0 && ret != -ENOENT) {
+ goto out_err;
+ }
+ }
+ freed = child_nofs;
+ }
+
+ if (!ofs) {
+ /* remove current indirect node */
+ dn->node_page = page;
+ truncate_node(dn);
+ freed++;
+ } else {
+ f2fs_put_page(page, 1);
+ }
+ return freed;
+
+out_err:
+ f2fs_put_page(page, 1);
+ return ret;
+}
+
+static int truncate_partial_nodes(struct dnode_of_data *dn,
+ struct f2fs_inode *ri, int *offset, int depth)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct page *pages[2];
+ nid_t nid[3];
+ nid_t child_nid;
+ int err = 0;
+ int i;
+ int idx = depth - 2;
+
+ nid[0] = le32_to_cpu(ri->i_nid[offset[0] - NODE_DIR1_BLOCK]);
+ if (!nid[0])
+ return 0;
+
+ /* get indirect nodes in the path */
+ for (i = 0; i < depth - 1; i++) {
+ /* refernece count'll be increased */
+ pages[i] = get_node_page(sbi, nid[i]);
+ if (IS_ERR(pages[i])) {
+ depth = i + 1;
+ err = PTR_ERR(pages[i]);
+ goto fail;
+ }
+ nid[i + 1] = get_nid(pages[i], offset[i + 1], false);
+ }
+
+ /* free direct nodes linked to a partial indirect node */
+ for (i = offset[depth - 1]; i < NIDS_PER_BLOCK; i++) {
+ child_nid = get_nid(pages[idx], i, false);
+ if (!child_nid)
+ continue;
+ dn->nid = child_nid;
+ err = truncate_dnode(dn);
+ if (err < 0)
+ goto fail;
+ set_nid(pages[idx], i, 0, false);
+ }
+
+ if (offset[depth - 1] == 0) {
+ dn->node_page = pages[idx];
+ dn->nid = nid[idx];
+ truncate_node(dn);
+ } else {
+ f2fs_put_page(pages[idx], 1);
+ }
+ offset[idx]++;
+ offset[depth - 1] = 0;
+fail:
+ for (i = depth - 3; i >= 0; i--)
+ f2fs_put_page(pages[i], 1);
+ return err;
+}
+
+/**
+ * All the block addresses of data and nodes should be nullified.
+ */
+int truncate_inode_blocks(struct inode *inode, pgoff_t from)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ int err = 0, cont = 1;
+ int level, offset[4], noffset[4];
+ unsigned int nofs;
+ struct f2fs_node *rn;
+ struct dnode_of_data dn;
+ struct page *page;
+
+ level = get_node_path(from, offset, noffset);
+
+ page = get_node_page(sbi, inode->i_ino);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ set_new_dnode(&dn, inode, page, NULL, 0);
+ unlock_page(page);
+
+ rn = page_address(page);
+ switch (level) {
+ case 0:
+ case 1:
+ nofs = noffset[1];
+ break;
+ case 2:
+ nofs = noffset[1];
+ if (!offset[level - 1])
+ goto skip_partial;
+ err = truncate_partial_nodes(&dn, &rn->i, offset, level);
+ if (err < 0 && err != -ENOENT)
+ goto fail;
+ nofs += 1 + NIDS_PER_BLOCK;
+ break;
+ case 3:
+ nofs = 5 + 2 * NIDS_PER_BLOCK;
+ if (!offset[level - 1])
+ goto skip_partial;
+ err = truncate_partial_nodes(&dn, &rn->i, offset, level);
+ if (err < 0 && err != -ENOENT)
+ goto fail;
+ break;
+ default:
+ BUG();
+ }
+
+skip_partial:
+ while (cont) {
+ dn.nid = le32_to_cpu(rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]);
+ switch (offset[0]) {
+ case NODE_DIR1_BLOCK:
+ case NODE_DIR2_BLOCK:
+ err = truncate_dnode(&dn);
+ break;
+
+ case NODE_IND1_BLOCK:
+ case NODE_IND2_BLOCK:
+ err = truncate_nodes(&dn, nofs, offset[1], 2);
+ break;
+
+ case NODE_DIND_BLOCK:
+ err = truncate_nodes(&dn, nofs, offset[1], 3);
+ cont = 0;
+ break;
+
+ default:
+ BUG();
+ }
+ if (err < 0 && err != -ENOENT)
+ goto fail;
+ if (offset[1] == 0 &&
+ rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK]) {
+ lock_page(page);
+ wait_on_page_writeback(page);
+ rn->i.i_nid[offset[0] - NODE_DIR1_BLOCK] = 0;
+ set_page_dirty(page);
+ unlock_page(page);
+ }
+ offset[1] = 0;
+ offset[0]++;
+ nofs += err;
+ }
+fail:
+ f2fs_put_page(page, 0);
+ return err > 0 ? 0 : err;
+}
+
+int remove_inode_page(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct page *page;
+ nid_t ino = inode->i_ino;
+ struct dnode_of_data dn;
+
+ mutex_lock_op(sbi, NODE_TRUNC);
+ page = get_node_page(sbi, ino);
+ if (IS_ERR(page)) {
+ mutex_unlock_op(sbi, NODE_TRUNC);
+ return PTR_ERR(page);
+ }
+
+ if (F2FS_I(inode)->i_xattr_nid) {
+ nid_t nid = F2FS_I(inode)->i_xattr_nid;
+ struct page *npage = get_node_page(sbi, nid);
+
+ if (IS_ERR(npage)) {
+ mutex_unlock_op(sbi, NODE_TRUNC);
+ return PTR_ERR(npage);
+ }
+
+ F2FS_I(inode)->i_xattr_nid = 0;
+ set_new_dnode(&dn, inode, page, npage, nid);
+ dn.inode_page_locked = 1;
+ truncate_node(&dn);
+ }
+ if (inode->i_blocks == 1) {
+ /* inernally call f2fs_put_page() */
+ set_new_dnode(&dn, inode, page, page, ino);
+ truncate_node(&dn);
+ } else if (inode->i_blocks == 0) {
+ struct node_info ni;
+ get_node_info(sbi, inode->i_ino, &ni);
+
+ /* called after f2fs_new_inode() is failed */
+ BUG_ON(ni.blk_addr != NULL_ADDR);
+ f2fs_put_page(page, 1);
+ } else {
+ BUG();
+ }
+ mutex_unlock_op(sbi, NODE_TRUNC);
+ return 0;
+}
+
+int new_inode_page(struct inode *inode, struct dentry *dentry)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct page *page;
+ struct dnode_of_data dn;
+
+ /* allocate inode page for new inode */
+ set_new_dnode(&dn, inode, NULL, NULL, inode->i_ino);
+ mutex_lock_op(sbi, NODE_NEW);
+ page = new_node_page(&dn, 0);
+ init_dent_inode(dentry, page);
+ mutex_unlock_op(sbi, NODE_NEW);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+ f2fs_put_page(page, 1);
+ return 0;
+}
+
+struct page *new_node_page(struct dnode_of_data *dn, unsigned int ofs)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+ struct node_info old_ni, new_ni;
+ struct page *page;
+ int err;
+
+ if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))
+ return ERR_PTR(-EPERM);
+
+ page = grab_cache_page(mapping, dn->nid);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ get_node_info(sbi, dn->nid, &old_ni);
+
+ SetPageUptodate(page);
+ fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true);
+
+ /* Reinitialize old_ni with new node page */
+ BUG_ON(old_ni.blk_addr != NULL_ADDR);
+ new_ni = old_ni;
+ new_ni.ino = dn->inode->i_ino;
+
+ if (!inc_valid_node_count(sbi, dn->inode, 1)) {
+ err = -ENOSPC;
+ goto fail;
+ }
+ set_node_addr(sbi, &new_ni, NEW_ADDR);
+
+ dn->node_page = page;
+ sync_inode_page(dn);
+ set_page_dirty(page);
+ set_cold_node(dn->inode, page);
+ if (ofs == 0)
+ inc_valid_inode_count(sbi);
+
+ return page;
+
+fail:
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+}
+
+static int read_node_page(struct page *page, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ struct node_info ni;
+
+ get_node_info(sbi, page->index, &ni);
+
+ if (ni.blk_addr == NULL_ADDR)
+ return -ENOENT;
+ return f2fs_readpage(sbi, page, ni.blk_addr, type);
+}
+
+/**
+ * Readahead a node page
+ */
+void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+ struct page *apage;
+
+ apage = find_get_page(mapping, nid);
+ if (apage && PageUptodate(apage))
+ goto release_out;
+ f2fs_put_page(apage, 0);
+
+ apage = grab_cache_page(mapping, nid);
+ if (!apage)
+ return;
+
+ if (read_node_page(apage, READA))
+ goto unlock_out;
+
+ page_cache_release(apage);
+ return;
+
+unlock_out:
+ unlock_page(apage);
+release_out:
+ page_cache_release(apage);
+}
+
+struct page *get_node_page(struct f2fs_sb_info *sbi, pgoff_t nid)
+{
+ int err;
+ struct page *page;
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+
+ page = grab_cache_page(mapping, nid);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ err = read_node_page(page, READ_SYNC);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+ }
+
+ BUG_ON(nid != nid_of_node(page));
+ mark_page_accessed(page);
+ return page;
+}
+
+/**
+ * Return a locked page for the desired node page.
+ * And, readahead MAX_RA_NODE number of node pages.
+ */
+struct page *get_node_page_ra(struct page *parent, int start)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(parent->mapping->host->i_sb);
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+ int i, end;
+ int err = 0;
+ nid_t nid;
+ struct page *page;
+
+ /* First, try getting the desired direct node. */
+ nid = get_nid(parent, start, false);
+ if (!nid)
+ return ERR_PTR(-ENOENT);
+
+ page = find_get_page(mapping, nid);
+ if (page && PageUptodate(page))
+ goto page_hit;
+ f2fs_put_page(page, 0);
+
+repeat:
+ page = grab_cache_page(mapping, nid);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ err = read_node_page(page, READA);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+ }
+
+ /* Then, try readahead for siblings of the desired node */
+ end = start + MAX_RA_NODE;
+ end = min(end, NIDS_PER_BLOCK);
+ for (i = start + 1; i < end; i++) {
+ nid = get_nid(parent, i, false);
+ if (!nid)
+ continue;
+ ra_node_page(sbi, nid);
+ }
+
+page_hit:
+ lock_page(page);
+ if (PageError(page)) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(-EIO);
+ }
+
+ /* Has the page been truncated? */
+ if (page->mapping != mapping) {
+ f2fs_put_page(page, 1);
+ goto repeat;
+ }
+ return page;
+}
+
+void sync_inode_page(struct dnode_of_data *dn)
+{
+ if (IS_INODE(dn->node_page) || dn->inode_page == dn->node_page) {
+ update_inode(dn->inode, dn->node_page);
+ } else if (dn->inode_page) {
+ if (!dn->inode_page_locked)
+ lock_page(dn->inode_page);
+ update_inode(dn->inode, dn->inode_page);
+ if (!dn->inode_page_locked)
+ unlock_page(dn->inode_page);
+ } else {
+ f2fs_write_inode(dn->inode, NULL);
+ }
+}
+
+int sync_node_pages(struct f2fs_sb_info *sbi, nid_t ino,
+ struct writeback_control *wbc)
+{
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+ pgoff_t index, end;
+ struct pagevec pvec;
+ int step = ino ? 2 : 0;
+ int nwritten = 0, wrote = 0;
+
+ pagevec_init(&pvec, 0);
+
+next_step:
+ index = 0;
+ end = LONG_MAX;
+
+ while (index <= end) {
+ int i, nr_pages;
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+ if (nr_pages == 0)
+ break;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+
+ /*
+ * flushing sequence with step:
+ * 0. indirect nodes
+ * 1. dentry dnodes
+ * 2. file dnodes
+ */
+ if (step == 0 && IS_DNODE(page))
+ continue;
+ if (step == 1 && (!IS_DNODE(page) ||
+ is_cold_node(page)))
+ continue;
+ if (step == 2 && (!IS_DNODE(page) ||
+ !is_cold_node(page)))
+ continue;
+
+ /*
+ * If an fsync mode,
+ * we should not skip writing node pages.
+ */
+ if (ino && ino_of_node(page) == ino)
+ lock_page(page);
+ else if (!trylock_page(page))
+ continue;
+
+ if (unlikely(page->mapping != mapping)) {
+continue_unlock:
+ unlock_page(page);
+ continue;
+ }
+ if (ino && ino_of_node(page) != ino)
+ goto continue_unlock;
+
+ if (!PageDirty(page)) {
+ /* someone wrote it for us */
+ goto continue_unlock;
+ }
+
+ if (!clear_page_dirty_for_io(page))
+ goto continue_unlock;
+
+ /* called by fsync() */
+ if (ino && IS_DNODE(page)) {
+ int mark = !is_checkpointed_node(sbi, ino);
+ set_fsync_mark(page, 1);
+ if (IS_INODE(page))
+ set_dentry_mark(page, mark);
+ nwritten++;
+ } else {
+ set_fsync_mark(page, 0);
+ set_dentry_mark(page, 0);
+ }
+ mapping->a_ops->writepage(page, wbc);
+ wrote++;
+
+ if (--wbc->nr_to_write == 0)
+ break;
+ }
+ pagevec_release(&pvec);
+ cond_resched();
+
+ if (wbc->nr_to_write == 0) {
+ step = 2;
+ break;
+ }
+ }
+
+ if (step < 2) {
+ step++;
+ goto next_step;
+ }
+
+ if (wrote)
+ f2fs_submit_bio(sbi, NODE, wbc->sync_mode == WB_SYNC_ALL);
+
+ return nwritten;
+}
+
+static int f2fs_write_node_page(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ nid_t nid;
+ unsigned int nofs;
+ block_t new_addr;
+ struct node_info ni;
+
+ if (wbc->for_reclaim) {
+ dec_page_count(sbi, F2FS_DIRTY_NODES);
+ wbc->pages_skipped++;
+ set_page_dirty(page);
+ return AOP_WRITEPAGE_ACTIVATE;
+ }
+
+ wait_on_page_writeback(page);
+
+ mutex_lock_op(sbi, NODE_WRITE);
+
+ /* get old block addr of this node page */
+ nid = nid_of_node(page);
+ nofs = ofs_of_node(page);
+ BUG_ON(page->index != nid);
+
+ get_node_info(sbi, nid, &ni);
+
+ /* This page is already truncated */
+ if (ni.blk_addr == NULL_ADDR)
+ return 0;
+
+ set_page_writeback(page);
+
+ /* insert node offset */
+ write_node_page(sbi, page, nid, ni.blk_addr, &new_addr);
+ set_node_addr(sbi, &ni, new_addr);
+ dec_page_count(sbi, F2FS_DIRTY_NODES);
+
+ mutex_unlock_op(sbi, NODE_WRITE);
+ unlock_page(page);
+ return 0;
+}
+
+static int f2fs_write_node_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
+ struct block_device *bdev = sbi->sb->s_bdev;
+ long nr_to_write = wbc->nr_to_write;
+
+ if (wbc->for_kupdate)
+ return 0;
+
+ if (get_pages(sbi, F2FS_DIRTY_NODES) == 0)
+ return 0;
+
+ if (try_to_free_nats(sbi, NAT_ENTRY_PER_BLOCK)) {
+ write_checkpoint(sbi, false, false);
+ return 0;
+ }
+
+ /* if mounting is failed, skip writing node pages */
+ wbc->nr_to_write = bio_get_nr_vecs(bdev);
+ sync_node_pages(sbi, 0, wbc);
+ wbc->nr_to_write = nr_to_write -
+ (bio_get_nr_vecs(bdev) - wbc->nr_to_write);
+ return 0;
+}
+
+static int f2fs_set_node_page_dirty(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct f2fs_sb_info *sbi = F2FS_SB(mapping->host->i_sb);
+
+ SetPageUptodate(page);
+ if (!PageDirty(page)) {
+ __set_page_dirty_nobuffers(page);
+ inc_page_count(sbi, F2FS_DIRTY_NODES);
+ SetPagePrivate(page);
+ return 1;
+ }
+ return 0;
+}
+
+static void f2fs_invalidate_node_page(struct page *page, unsigned long offset)
+{
+ struct inode *inode = page->mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ if (PageDirty(page))
+ dec_page_count(sbi, F2FS_DIRTY_NODES);
+ ClearPagePrivate(page);
+}
+
+static int f2fs_release_node_page(struct page *page, gfp_t wait)
+{
+ ClearPagePrivate(page);
+ return 0;
+}
+
+/**
+ * Structure of the f2fs node operations
+ */
+const struct address_space_operations f2fs_node_aops = {
+ .writepage = f2fs_write_node_page,
+ .writepages = f2fs_write_node_pages,
+ .set_page_dirty = f2fs_set_node_page_dirty,
+ .invalidatepage = f2fs_invalidate_node_page,
+ .releasepage = f2fs_release_node_page,
+};
+
+static struct free_nid *__lookup_free_nid_list(nid_t n, struct list_head *head)
+{
+ struct list_head *this;
+ struct free_nid *i = NULL;
+ list_for_each(this, head) {
+ i = list_entry(this, struct free_nid, list);
+ if (i->nid == n)
+ break;
+ i = NULL;
+ }
+ return i;
+}
+
+static void __del_from_free_nid_list(struct free_nid *i)
+{
+ list_del(&i->list);
+ kmem_cache_free(free_nid_slab, i);
+}
+
+static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid)
+{
+ struct free_nid *i;
+
+ if (nm_i->fcnt > 2 * MAX_FREE_NIDS)
+ return 0;
+retry:
+ i = kmem_cache_alloc(free_nid_slab, GFP_NOFS);
+ if (!i) {
+ cond_resched();
+ goto retry;
+ }
+ i->nid = nid;
+ i->state = NID_NEW;
+
+ spin_lock(&nm_i->free_nid_list_lock);
+ if (__lookup_free_nid_list(nid, &nm_i->free_nid_list)) {
+ spin_unlock(&nm_i->free_nid_list_lock);
+ kmem_cache_free(free_nid_slab, i);
+ return 0;
+ }
+ list_add_tail(&i->list, &nm_i->free_nid_list);
+ nm_i->fcnt++;
+ spin_unlock(&nm_i->free_nid_list_lock);
+ return 1;
+}
+
+static void remove_free_nid(struct f2fs_nm_info *nm_i, nid_t nid)
+{
+ struct free_nid *i;
+ spin_lock(&nm_i->free_nid_list_lock);
+ i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
+ if (i && i->state == NID_NEW) {
+ __del_from_free_nid_list(i);
+ nm_i->fcnt--;
+ }
+ spin_unlock(&nm_i->free_nid_list_lock);
+}
+
+static int scan_nat_page(struct f2fs_nm_info *nm_i,
+ struct page *nat_page, nid_t start_nid)
+{
+ struct f2fs_nat_block *nat_blk = page_address(nat_page);
+ block_t blk_addr;
+ int fcnt = 0;
+ int i;
+
+ /* 0 nid should not be used */
+ if (start_nid == 0)
+ ++start_nid;
+
+ i = start_nid % NAT_ENTRY_PER_BLOCK;
+
+ for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) {
+ blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr);
+ BUG_ON(blk_addr == NEW_ADDR);
+ if (blk_addr == NULL_ADDR)
+ fcnt += add_free_nid(nm_i, start_nid);
+ }
+ return fcnt;
+}
+
+static void build_free_nids(struct f2fs_sb_info *sbi)
+{
+ struct free_nid *fnid, *next_fnid;
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ nid_t nid = 0;
+ bool is_cycled = false;
+ int fcnt = 0;
+ int i;
+
+ nid = nm_i->next_scan_nid;
+ nm_i->init_scan_nid = nid;
+
+ ra_nat_pages(sbi, nid);
+
+ while (1) {
+ struct page *page = get_current_nat_page(sbi, nid);
+
+ fcnt += scan_nat_page(nm_i, page, nid);
+ f2fs_put_page(page, 1);
+
+ nid += (NAT_ENTRY_PER_BLOCK - (nid % NAT_ENTRY_PER_BLOCK));
+
+ if (nid >= nm_i->max_nid) {
+ nid = 0;
+ is_cycled = true;
+ }
+ if (fcnt > MAX_FREE_NIDS)
+ break;
+ if (is_cycled && nm_i->init_scan_nid <= nid)
+ break;
+ }
+
+ nm_i->next_scan_nid = nid;
+
+ /* find free nids from current sum_pages */
+ mutex_lock(&curseg->curseg_mutex);
+ for (i = 0; i < nats_in_cursum(sum); i++) {
+ block_t addr = le32_to_cpu(nat_in_journal(sum, i).block_addr);
+ nid = le32_to_cpu(nid_in_journal(sum, i));
+ if (addr == NULL_ADDR)
+ add_free_nid(nm_i, nid);
+ else
+ remove_free_nid(nm_i, nid);
+ }
+ mutex_unlock(&curseg->curseg_mutex);
+
+ /* remove the free nids from current allocated nids */
+ list_for_each_entry_safe(fnid, next_fnid, &nm_i->free_nid_list, list) {
+ struct nat_entry *ne;
+
+ read_lock(&nm_i->nat_tree_lock);
+ ne = __lookup_nat_cache(nm_i, fnid->nid);
+ if (ne && nat_get_blkaddr(ne) != NULL_ADDR)
+ remove_free_nid(nm_i, fnid->nid);
+ read_unlock(&nm_i->nat_tree_lock);
+ }
+}
+
+/*
+ * If this function returns success, caller can obtain a new nid
+ * from second parameter of this function.
+ * The returned nid could be used ino as well as nid when inode is created.
+ */
+bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *i = NULL;
+ struct list_head *this;
+retry:
+ mutex_lock(&nm_i->build_lock);
+ if (!nm_i->fcnt) {
+ /* scan NAT in order to build free nid list */
+ build_free_nids(sbi);
+ if (!nm_i->fcnt) {
+ mutex_unlock(&nm_i->build_lock);
+ return false;
+ }
+ }
+ mutex_unlock(&nm_i->build_lock);
+
+ /*
+ * We check fcnt again since previous check is racy as
+ * we didn't hold free_nid_list_lock. So other thread
+ * could consume all of free nids.
+ */
+ spin_lock(&nm_i->free_nid_list_lock);
+ if (!nm_i->fcnt) {
+ spin_unlock(&nm_i->free_nid_list_lock);
+ goto retry;
+ }
+
+ BUG_ON(list_empty(&nm_i->free_nid_list));
+ list_for_each(this, &nm_i->free_nid_list) {
+ i = list_entry(this, struct free_nid, list);
+ if (i->state == NID_NEW)
+ break;
+ }
+
+ BUG_ON(i->state != NID_NEW);
+ *nid = i->nid;
+ i->state = NID_ALLOC;
+ nm_i->fcnt--;
+ spin_unlock(&nm_i->free_nid_list_lock);
+ return true;
+}
+
+/**
+ * alloc_nid() should be called prior to this function.
+ */
+void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *i;
+
+ spin_lock(&nm_i->free_nid_list_lock);
+ i = __lookup_free_nid_list(nid, &nm_i->free_nid_list);
+ if (i) {
+ BUG_ON(i->state != NID_ALLOC);
+ __del_from_free_nid_list(i);
+ }
+ spin_unlock(&nm_i->free_nid_list_lock);
+}
+
+/**
+ * alloc_nid() should be called prior to this function.
+ */
+void alloc_nid_failed(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ alloc_nid_done(sbi, nid);
+ add_free_nid(NM_I(sbi), nid);
+}
+
+void recover_node_page(struct f2fs_sb_info *sbi, struct page *page,
+ struct f2fs_summary *sum, struct node_info *ni,
+ block_t new_blkaddr)
+{
+ rewrite_node_page(sbi, page, sum, ni->blk_addr, new_blkaddr);
+ set_node_addr(sbi, ni, new_blkaddr);
+ clear_node_page_dirty(page);
+}
+
+int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page)
+{
+ struct address_space *mapping = sbi->node_inode->i_mapping;
+ struct f2fs_node *src, *dst;
+ nid_t ino = ino_of_node(page);
+ struct node_info old_ni, new_ni;
+ struct page *ipage;
+
+ ipage = grab_cache_page(mapping, ino);
+ if (!ipage)
+ return -ENOMEM;
+
+ /* Should not use this inode from free nid list */
+ remove_free_nid(NM_I(sbi), ino);
+
+ get_node_info(sbi, ino, &old_ni);
+ SetPageUptodate(ipage);
+ fill_node_footer(ipage, ino, ino, 0, true);
+
+ src = (struct f2fs_node *)page_address(page);
+ dst = (struct f2fs_node *)page_address(ipage);
+
+ memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i);
+ dst->i.i_size = 0;
+ dst->i.i_blocks = 1;
+ dst->i.i_links = 1;
+ dst->i.i_xattr_nid = 0;
+
+ new_ni = old_ni;
+ new_ni.ino = ino;
+
+ set_node_addr(sbi, &new_ni, NEW_ADDR);
+ inc_valid_inode_count(sbi);
+
+ f2fs_put_page(ipage, 1);
+ return 0;
+}
+
+int restore_node_summary(struct f2fs_sb_info *sbi,
+ unsigned int segno, struct f2fs_summary_block *sum)
+{
+ struct f2fs_node *rn;
+ struct f2fs_summary *sum_entry;
+ struct page *page;
+ block_t addr;
+ int i, last_offset;
+
+ /* alloc temporal page for read node */
+ page = alloc_page(GFP_NOFS | __GFP_ZERO);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+ lock_page(page);
+
+ /* scan the node segment */
+ last_offset = sbi->blocks_per_seg;
+ addr = START_BLOCK(sbi, segno);
+ sum_entry = &sum->entries[0];
+
+ for (i = 0; i < last_offset; i++, sum_entry++) {
+ if (f2fs_readpage(sbi, page, addr, READ_SYNC))
+ goto out;
+
+ rn = (struct f2fs_node *)page_address(page);
+ sum_entry->nid = rn->footer.nid;
+ sum_entry->version = 0;
+ sum_entry->ofs_in_node = 0;
+ addr++;
+
+ /*
+ * In order to read next node page,
+ * we must clear PageUptodate flag.
+ */
+ ClearPageUptodate(page);
+ }
+out:
+ unlock_page(page);
+ __free_pages(page, 0);
+ return 0;
+}
+
+static bool flush_nats_in_journal(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ int i;
+
+ mutex_lock(&curseg->curseg_mutex);
+
+ if (nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES) {
+ mutex_unlock(&curseg->curseg_mutex);
+ return false;
+ }
+
+ for (i = 0; i < nats_in_cursum(sum); i++) {
+ struct nat_entry *ne;
+ struct f2fs_nat_entry raw_ne;
+ nid_t nid = le32_to_cpu(nid_in_journal(sum, i));
+
+ raw_ne = nat_in_journal(sum, i);
+retry:
+ write_lock(&nm_i->nat_tree_lock);
+ ne = __lookup_nat_cache(nm_i, nid);
+ if (ne) {
+ __set_nat_cache_dirty(nm_i, ne);
+ write_unlock(&nm_i->nat_tree_lock);
+ continue;
+ }
+ ne = grab_nat_entry(nm_i, nid);
+ if (!ne) {
+ write_unlock(&nm_i->nat_tree_lock);
+ goto retry;
+ }
+ nat_set_blkaddr(ne, le32_to_cpu(raw_ne.block_addr));
+ nat_set_ino(ne, le32_to_cpu(raw_ne.ino));
+ nat_set_version(ne, raw_ne.version);
+ __set_nat_cache_dirty(nm_i, ne);
+ write_unlock(&nm_i->nat_tree_lock);
+ }
+ update_nats_in_cursum(sum, -i);
+ mutex_unlock(&curseg->curseg_mutex);
+ return true;
+}
+
+/**
+ * This function is called during the checkpointing process.
+ */
+void flush_nat_entries(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ struct list_head *cur, *n;
+ struct page *page = NULL;
+ struct f2fs_nat_block *nat_blk = NULL;
+ nid_t start_nid = 0, end_nid = 0;
+ bool flushed;
+
+ flushed = flush_nats_in_journal(sbi);
+
+ if (!flushed)
+ mutex_lock(&curseg->curseg_mutex);
+
+ /* 1) flush dirty nat caches */
+ list_for_each_safe(cur, n, &nm_i->dirty_nat_entries) {
+ struct nat_entry *ne;
+ nid_t nid;
+ struct f2fs_nat_entry raw_ne;
+ int offset = -1;
+ block_t old_blkaddr, new_blkaddr;
+
+ ne = list_entry(cur, struct nat_entry, list);
+ nid = nat_get_nid(ne);
+
+ if (nat_get_blkaddr(ne) == NEW_ADDR)
+ continue;
+ if (flushed)
+ goto to_nat_page;
+
+ /* if there is room for nat enries in curseg->sumpage */
+ offset = lookup_journal_in_cursum(sum, NAT_JOURNAL, nid, 1);
+ if (offset >= 0) {
+ raw_ne = nat_in_journal(sum, offset);
+ old_blkaddr = le32_to_cpu(raw_ne.block_addr);
+ goto flush_now;
+ }
+to_nat_page:
+ if (!page || (start_nid > nid || nid > end_nid)) {
+ if (page) {
+ f2fs_put_page(page, 1);
+ page = NULL;
+ }
+ start_nid = START_NID(nid);
+ end_nid = start_nid + NAT_ENTRY_PER_BLOCK - 1;
+
+ /*
+ * get nat block with dirty flag, increased reference
+ * count, mapped and lock
+ */
+ page = get_next_nat_page(sbi, start_nid);
+ nat_blk = page_address(page);
+ }
+
+ BUG_ON(!nat_blk);
+ raw_ne = nat_blk->entries[nid - start_nid];
+ old_blkaddr = le32_to_cpu(raw_ne.block_addr);
+flush_now:
+ new_blkaddr = nat_get_blkaddr(ne);
+
+ raw_ne.ino = cpu_to_le32(nat_get_ino(ne));
+ raw_ne.block_addr = cpu_to_le32(new_blkaddr);
+ raw_ne.version = nat_get_version(ne);
+
+ if (offset < 0) {
+ nat_blk->entries[nid - start_nid] = raw_ne;
+ } else {
+ nat_in_journal(sum, offset) = raw_ne;
+ nid_in_journal(sum, offset) = cpu_to_le32(nid);
+ }
+
+ if (nat_get_blkaddr(ne) == NULL_ADDR) {
+ write_lock(&nm_i->nat_tree_lock);
+ __del_from_nat_cache(nm_i, ne);
+ write_unlock(&nm_i->nat_tree_lock);
+
+ /* We can reuse this freed nid at this point */
+ add_free_nid(NM_I(sbi), nid);
+ } else {
+ write_lock(&nm_i->nat_tree_lock);
+ __clear_nat_cache_dirty(nm_i, ne);
+ ne->checkpointed = true;
+ write_unlock(&nm_i->nat_tree_lock);
+ }
+ }
+ if (!flushed)
+ mutex_unlock(&curseg->curseg_mutex);
+ f2fs_put_page(page, 1);
+
+ /* 2) shrink nat caches if necessary */
+ try_to_free_nats(sbi, nm_i->nat_cnt - NM_WOUT_THRESHOLD);
+}
+
+static int init_node_manager(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *sb_raw = F2FS_RAW_SUPER(sbi);
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ unsigned char *version_bitmap;
+ unsigned int nat_segs, nat_blocks;
+
+ nm_i->nat_blkaddr = le32_to_cpu(sb_raw->nat_blkaddr);
+
+ /* segment_count_nat includes pair segment so divide to 2. */
+ nat_segs = le32_to_cpu(sb_raw->segment_count_nat) >> 1;
+ nat_blocks = nat_segs << le32_to_cpu(sb_raw->log_blocks_per_seg);
+ nm_i->max_nid = NAT_ENTRY_PER_BLOCK * nat_blocks;
+ nm_i->fcnt = 0;
+ nm_i->nat_cnt = 0;
+
+ INIT_LIST_HEAD(&nm_i->free_nid_list);
+ INIT_RADIX_TREE(&nm_i->nat_root, GFP_ATOMIC);
+ INIT_LIST_HEAD(&nm_i->nat_entries);
+ INIT_LIST_HEAD(&nm_i->dirty_nat_entries);
+
+ mutex_init(&nm_i->build_lock);
+ spin_lock_init(&nm_i->free_nid_list_lock);
+ rwlock_init(&nm_i->nat_tree_lock);
+
+ nm_i->bitmap_size = __bitmap_size(sbi, NAT_BITMAP);
+ nm_i->init_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid);
+ nm_i->next_scan_nid = le32_to_cpu(sbi->ckpt->next_free_nid);
+
+ nm_i->nat_bitmap = kzalloc(nm_i->bitmap_size, GFP_KERNEL);
+ if (!nm_i->nat_bitmap)
+ return -ENOMEM;
+ version_bitmap = __bitmap_ptr(sbi, NAT_BITMAP);
+ if (!version_bitmap)
+ return -EFAULT;
+
+ /* copy version bitmap */
+ memcpy(nm_i->nat_bitmap, version_bitmap, nm_i->bitmap_size);
+ return 0;
+}
+
+int build_node_manager(struct f2fs_sb_info *sbi)
+{
+ int err;
+
+ sbi->nm_info = kzalloc(sizeof(struct f2fs_nm_info), GFP_KERNEL);
+ if (!sbi->nm_info)
+ return -ENOMEM;
+
+ err = init_node_manager(sbi);
+ if (err)
+ return err;
+
+ build_free_nids(sbi);
+ return 0;
+}
+
+void destroy_node_manager(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *i, *next_i;
+ struct nat_entry *natvec[NATVEC_SIZE];
+ nid_t nid = 0;
+ unsigned int found;
+
+ if (!nm_i)
+ return;
+
+ /* destroy free nid list */
+ spin_lock(&nm_i->free_nid_list_lock);
+ list_for_each_entry_safe(i, next_i, &nm_i->free_nid_list, list) {
+ BUG_ON(i->state == NID_ALLOC);
+ __del_from_free_nid_list(i);
+ nm_i->fcnt--;
+ }
+ BUG_ON(nm_i->fcnt);
+ spin_unlock(&nm_i->free_nid_list_lock);
+
+ /* destroy nat cache */
+ write_lock(&nm_i->nat_tree_lock);
+ while ((found = __gang_lookup_nat_cache(nm_i,
+ nid, NATVEC_SIZE, natvec))) {
+ unsigned idx;
+ for (idx = 0; idx < found; idx++) {
+ struct nat_entry *e = natvec[idx];
+ nid = nat_get_nid(e) + 1;
+ __del_from_nat_cache(nm_i, e);
+ }
+ }
+ BUG_ON(nm_i->nat_cnt);
+ write_unlock(&nm_i->nat_tree_lock);
+
+ kfree(nm_i->nat_bitmap);
+ sbi->nm_info = NULL;
+ kfree(nm_i);
+}
+
+int create_node_manager_caches(void)
+{
+ nat_entry_slab = f2fs_kmem_cache_create("nat_entry",
+ sizeof(struct nat_entry), NULL);
+ if (!nat_entry_slab)
+ return -ENOMEM;
+
+ free_nid_slab = f2fs_kmem_cache_create("free_nid",
+ sizeof(struct free_nid), NULL);
+ if (!free_nid_slab) {
+ kmem_cache_destroy(nat_entry_slab);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void destroy_node_manager_caches(void)
+{
+ kmem_cache_destroy(free_nid_slab);
+ kmem_cache_destroy(nat_entry_slab);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:44:57

[permalink] [raw]

Subject: [PATCH 08/17] f2fs: add file operations

This adds memory operations and file/file_inode operations.

- F2FS supports fallocate(), mmap(), fsync(), and basic ioctl().

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/file.c | 637 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 637 insertions(+)
create mode 100644 fs/f2fs/file.c

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
new file mode 100644
index 0000000..f5ae36d
--- /dev/null
+++ b/fs/f2fs/file.c
@@ -0,0 +1,637 @@
+/**
+ * fs/f2fs/file.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/stat.h>
+#include <linux/buffer_head.h>
+#include <linux/writeback.h>
+#include <linux/falloc.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/mount.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+#include "xattr.h"
+#include "acl.h"
+
+static int f2fs_vm_page_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct page *node_page;
+ block_t old_blk_addr;
+ struct dnode_of_data dn;
+ int err;
+
+ f2fs_balance_fs(sbi);
+
+ sb_start_pagefault(inode->i_sb);
+
+ mutex_lock_op(sbi, DATA_NEW);
+
+ /* block allocation */
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, page->index, 0);
+ if (err) {
+ mutex_unlock_op(sbi, DATA_NEW);
+ goto out;
+ }
+
+ old_blk_addr = dn.data_blkaddr;
+ node_page = dn.node_page;
+
+ if (old_blk_addr == NULL_ADDR) {
+ err = reserve_new_block(&dn);
+ if (err) {
+ f2fs_put_dnode(&dn);
+ mutex_unlock_op(sbi, DATA_NEW);
+ goto out;
+ }
+ }
+ f2fs_put_dnode(&dn);
+
+ mutex_unlock_op(sbi, DATA_NEW);
+
+ lock_page(page);
+ if (page->mapping != inode->i_mapping ||
+ page_offset(page) >= i_size_read(inode) ||
+ !PageUptodate(page)) {
+ unlock_page(page);
+ err = -EFAULT;
+ goto out;
+ }
+
+ /*
+ * check to see if the page is mapped already (no holes)
+ */
+ if (PageMappedToDisk(page))
+ goto out;
+
+ /* fill the page */
+ wait_on_page_writeback(page);
+
+ /* page is wholly or partially inside EOF */
+ if (((page->index + 1) << PAGE_CACHE_SHIFT) > i_size_read(inode)) {
+ unsigned offset;
+ offset = i_size_read(inode) & ~PAGE_CACHE_MASK;
+ zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+ }
+ set_page_dirty(page);
+ SetPageUptodate(page);
+
+ file_update_time(vma->vm_file);
+out:
+ sb_end_pagefault(inode->i_sb);
+ return block_page_mkwrite_return(err);
+}
+
+static const struct vm_operations_struct f2fs_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = f2fs_vm_page_mkwrite,
+};
+
+static int need_to_sync_dir(struct f2fs_sb_info *sbi, struct inode *inode)
+{
+ struct dentry *dentry;
+ nid_t pino;
+
+ inode = igrab(inode);
+ dentry = d_find_any_alias(inode);
+ if (!dentry) {
+ iput(inode);
+ return 0;
+ }
+ pino = dentry->d_parent->d_inode->i_ino;
+ dput(dentry);
+ iput(inode);
+ return !is_checkpointed_node(sbi, pino);
+}
+
+int f2fs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ unsigned long long cur_version;
+ int ret = 0;
+ bool need_cp = false;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = LONG_MAX,
+ .for_reclaim = 0,
+ };
+
+ ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (ret)
+ return ret;
+
+ mutex_lock(&inode->i_mutex);
+
+ if (inode->i_sb->s_flags & MS_RDONLY)
+ goto out;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ goto out;
+
+ mutex_lock(&sbi->cp_mutex);
+ cur_version = le64_to_cpu(F2FS_CKPT(sbi)->checkpoint_ver);
+ mutex_unlock(&sbi->cp_mutex);
+
+ if (F2FS_I(inode)->data_version != cur_version &&
+ !(inode->i_state & I_DIRTY))
+ goto out;
+ F2FS_I(inode)->data_version--;
+
+ if (!S_ISREG(inode->i_mode) || inode->i_nlink != 1)
+ need_cp = true;
+ if (is_inode_flag_set(F2FS_I(inode), FI_NEED_CP))
+ need_cp = true;
+ if (!space_for_roll_forward(sbi))
+ need_cp = true;
+ if (need_to_sync_dir(sbi, inode))
+ need_cp = true;
+
+ f2fs_write_inode(inode, NULL);
+
+ if (need_cp) {
+ /* all the dirty node pages should be flushed for POR */
+ ret = f2fs_sync_fs(inode->i_sb, 1);
+ clear_inode_flag(F2FS_I(inode), FI_NEED_CP);
+ } else {
+ while (sync_node_pages(sbi, inode->i_ino, &wbc) == 0)
+ f2fs_write_inode(inode, NULL);
+ filemap_fdatawait_range(sbi->node_inode->i_mapping,
+ 0, LONG_MAX);
+ }
+out:
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+}
+
+static int f2fs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ file_accessed(file);
+ vma->vm_ops = &f2fs_file_vm_ops;
+ return 0;
+}
+
+static int truncate_data_blocks_range(struct dnode_of_data *dn, int count)
+{
+ int nr_free = 0, ofs = dn->ofs_in_node;
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+ struct f2fs_node *raw_node;
+ __le32 *addr;
+
+ raw_node = page_address(dn->node_page);
+ addr = blkaddr_in_node(raw_node) + ofs;
+
+ for ( ; count > 0; count--, addr++, dn->ofs_in_node++) {
+ block_t blkaddr = le32_to_cpu(*addr);
+ if (blkaddr == NULL_ADDR)
+ continue;
+
+ update_extent_cache(NULL_ADDR, dn);
+ invalidate_blocks(sbi, blkaddr);
+ dec_valid_block_count(sbi, dn->inode, 1);
+ nr_free++;
+ }
+ if (nr_free) {
+ set_page_dirty(dn->node_page);
+ sync_inode_page(dn);
+ }
+ dn->ofs_in_node = ofs;
+ return nr_free;
+}
+
+void truncate_data_blocks(struct dnode_of_data *dn)
+{
+ truncate_data_blocks_range(dn, ADDRS_PER_BLOCK);
+}
+
+static void truncate_partial_data_page(struct inode *inode, u64 from)
+{
+ unsigned offset = from & (PAGE_CACHE_SIZE - 1);
+ struct page *page;
+
+ if (!offset)
+ return;
+
+ page = find_data_page(inode, from >> PAGE_CACHE_SHIFT);
+ if (IS_ERR(page))
+ return;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ zero_user(page, offset, PAGE_CACHE_SIZE - offset);
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+}
+
+static int truncate_blocks(struct inode *inode, u64 from)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ unsigned int blocksize = inode->i_sb->s_blocksize;
+ struct dnode_of_data dn;
+ pgoff_t free_from;
+ int count = 0;
+ int err;
+
+ free_from = (pgoff_t)
+ ((from + blocksize - 1) >> (sbi->log_blocksize));
+
+ mutex_lock_op(sbi, DATA_TRUNC);
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, free_from, RDONLY_NODE);
+ if (err) {
+ if (err == -ENOENT)
+ goto free_next;
+ mutex_unlock_op(sbi, DATA_TRUNC);
+ return err;
+ }
+
+ if (IS_INODE(dn.node_page))
+ count = ADDRS_PER_INODE;
+ else
+ count = ADDRS_PER_BLOCK;
+
+ count -= dn.ofs_in_node;
+ BUG_ON(count < 0);
+ if (dn.ofs_in_node || IS_INODE(dn.node_page)) {
+ truncate_data_blocks_range(&dn, count);
+ free_from += count;
+ }
+
+ f2fs_put_dnode(&dn);
+free_next:
+ err = truncate_inode_blocks(inode, free_from);
+ mutex_unlock_op(sbi, DATA_TRUNC);
+
+ /* lastly zero out the first data page */
+ truncate_partial_data_page(inode, from);
+
+ return err;
+}
+
+void f2fs_truncate(struct inode *inode)
+{
+ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+ S_ISLNK(inode->i_mode)))
+ return;
+
+ if (!truncate_blocks(inode, i_size_read(inode))) {
+ inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(inode);
+ }
+
+ f2fs_balance_fs(F2FS_SB(inode->i_sb));
+}
+
+static int f2fs_getattr(struct vfsmount *mnt,
+ struct dentry *dentry, struct kstat *stat)
+{
+ struct inode *inode = dentry->d_inode;
+ generic_fillattr(inode, stat);
+ stat->blocks <<= 3;
+ return 0;
+}
+
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+static void __setattr_copy(struct inode *inode, const struct iattr *attr)
+{
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ unsigned int ia_valid = attr->ia_valid;
+
+ if (ia_valid & ATTR_UID)
+ inode->i_uid = attr->ia_uid;
+ if (ia_valid & ATTR_GID)
+ inode->i_gid = attr->ia_gid;
+ if (ia_valid & ATTR_ATIME)
+ inode->i_atime = timespec_trunc(attr->ia_atime,
+ inode->i_sb->s_time_gran);
+ if (ia_valid & ATTR_MTIME)
+ inode->i_mtime = timespec_trunc(attr->ia_mtime,
+ inode->i_sb->s_time_gran);
+ if (ia_valid & ATTR_CTIME)
+ inode->i_ctime = timespec_trunc(attr->ia_ctime,
+ inode->i_sb->s_time_gran);
+ if (ia_valid & ATTR_MODE) {
+ umode_t mode = attr->ia_mode;
+
+ if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+ mode &= ~S_ISGID;
+ set_acl_inode(fi, mode);
+ }
+}
+#else
+#define __setattr_copy setattr_copy
+#endif
+
+int f2fs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct inode *inode = dentry->d_inode;
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ int err;
+
+ err = inode_change_ok(inode, attr);
+ if (err)
+ return err;
+
+ if ((attr->ia_valid & ATTR_SIZE) &&
+ attr->ia_size != i_size_read(inode)) {
+ truncate_setsize(inode, attr->ia_size);
+ f2fs_truncate(inode);
+ }
+
+ __setattr_copy(inode, attr);
+
+ if (attr->ia_valid & ATTR_MODE) {
+ err = f2fs_acl_chmod(inode);
+ if (err || is_inode_flag_set(fi, FI_ACL_MODE)) {
+ inode->i_mode = fi->i_acl_mode;
+ clear_inode_flag(fi, FI_ACL_MODE);
+ }
+ }
+
+ mark_inode_dirty(inode);
+ return err;
+}
+
+const struct inode_operations f2fs_file_inode_operations = {
+ .getattr = f2fs_getattr,
+ .setattr = f2fs_setattr,
+ .get_acl = f2fs_get_acl,
+#ifdef CONFIG_F2FS_FS_XATTR
+ .setxattr = generic_setxattr,
+ .getxattr = generic_getxattr,
+ .listxattr = f2fs_listxattr,
+ .removexattr = generic_removexattr,
+#endif
+};
+
+static void fill_zero(struct inode *inode, pgoff_t index,
+ loff_t start, loff_t len)
+{
+ struct page *page;
+
+ if (!len)
+ return;
+
+ page = get_new_data_page(inode, index, false);
+
+ if (!IS_ERR(page)) {
+ wait_on_page_writeback(page);
+ zero_user(page, start, len);
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+ }
+}
+
+int truncate_hole(struct inode *inode, pgoff_t pg_start, pgoff_t pg_end)
+{
+ pgoff_t index;
+ int err;
+
+ for (index = pg_start; index < pg_end; index++) {
+ struct dnode_of_data dn;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+
+ mutex_lock_op(sbi, DATA_TRUNC);
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, index, RDONLY_NODE);
+ if (err) {
+ mutex_unlock_op(sbi, DATA_TRUNC);
+ if (err == -ENOENT)
+ continue;
+ return err;
+ }
+
+ if (dn.data_blkaddr != NULL_ADDR)
+ truncate_data_blocks_range(&dn, 1);
+ f2fs_put_dnode(&dn);
+ mutex_unlock_op(sbi, DATA_TRUNC);
+ }
+ return 0;
+}
+
+static int punch_hole(struct inode *inode, loff_t offset, loff_t len, int mode)
+{
+ pgoff_t pg_start, pg_end;
+ loff_t off_start, off_end;
+ int ret = 0;
+
+ pg_start = ((unsigned long long) offset) >> PAGE_CACHE_SHIFT;
+ pg_end = ((unsigned long long) offset + len) >> PAGE_CACHE_SHIFT;
+
+ off_start = offset & (PAGE_CACHE_SIZE - 1);
+ off_end = (offset + len) & (PAGE_CACHE_SIZE - 1);
+
+ if (pg_start == pg_end) {
+ fill_zero(inode, pg_start, off_start,
+ off_end - off_start);
+ } else {
+ if (off_start)
+ fill_zero(inode, pg_start++, off_start,
+ PAGE_CACHE_SIZE - off_start);
+ if (off_end)
+ fill_zero(inode, pg_end, 0, off_end);
+
+ if (pg_start < pg_end) {
+ struct address_space *mapping = inode->i_mapping;
+ loff_t blk_start, blk_end;
+
+ blk_start = pg_start << PAGE_CACHE_SHIFT;
+ blk_end = pg_end << PAGE_CACHE_SHIFT;
+ truncate_inode_pages_range(mapping, blk_start,
+ blk_end - 1);
+ ret = truncate_hole(inode, pg_start, pg_end);
+ }
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ i_size_read(inode) <= (offset + len)) {
+ i_size_write(inode, offset);
+ mark_inode_dirty(inode);
+ }
+
+ return ret;
+}
+
+static int expand_inode_data(struct inode *inode, loff_t offset,
+ loff_t len, int mode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ pgoff_t index, pg_start, pg_end;
+ loff_t new_size = i_size_read(inode);
+ loff_t off_start, off_end;
+ int ret = 0;
+
+ ret = inode_newsize_ok(inode, (len + offset));
+ if (ret)
+ return ret;
+
+ pg_start = ((unsigned long long) offset) >> PAGE_CACHE_SHIFT;
+ pg_end = ((unsigned long long) offset + len) >> PAGE_CACHE_SHIFT;
+
+ off_start = offset & (PAGE_CACHE_SIZE - 1);
+ off_end = (offset + len) & (PAGE_CACHE_SIZE - 1);
+
+ for (index = pg_start; index <= pg_end; index++) {
+ struct dnode_of_data dn;
+
+ mutex_lock_op(sbi, DATA_NEW);
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ ret = get_dnode_of_data(&dn, index, 0);
+ if (ret) {
+ mutex_unlock_op(sbi, DATA_NEW);
+ break;
+ }
+
+ if (dn.data_blkaddr == NULL_ADDR) {
+ ret = reserve_new_block(&dn);
+ if (ret) {
+ f2fs_put_dnode(&dn);
+ mutex_unlock_op(sbi, DATA_NEW);
+ break;
+ }
+ }
+ f2fs_put_dnode(&dn);
+
+ mutex_unlock_op(sbi, DATA_NEW);
+
+ if (pg_start == pg_end)
+ new_size = offset + len;
+ else if (index == pg_start && off_start)
+ new_size = (index + 1) << PAGE_CACHE_SHIFT;
+ else if (index == pg_end)
+ new_size = (index << PAGE_CACHE_SHIFT) + off_end;
+ else
+ new_size += PAGE_CACHE_SIZE;
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ i_size_read(inode) < new_size) {
+ i_size_write(inode, new_size);
+ mark_inode_dirty(inode);
+ }
+
+ return ret;
+}
+
+static long f2fs_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ long ret;
+
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ return -EOPNOTSUPP;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE)
+ ret = punch_hole(inode, offset, len, mode);
+ else
+ ret = expand_inode_data(inode, offset, len, mode);
+
+ f2fs_balance_fs(sbi);
+ return ret;
+}
+
+#define F2FS_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL))
+#define F2FS_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL)
+
+static inline __u32 f2fs_mask_flags(umode_t mode, __u32 flags)
+{
+ if (S_ISDIR(mode))
+ return flags;
+ else if (S_ISREG(mode))
+ return flags & F2FS_REG_FLMASK;
+ else
+ return flags & F2FS_OTHER_FLMASK;
+}
+
+long f2fs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ unsigned int flags;
+ int ret;
+
+ switch (cmd) {
+ case FS_IOC_GETFLAGS:
+ flags = fi->i_flags & FS_FL_USER_VISIBLE;
+ return put_user(flags, (int __user *) arg);
+ case FS_IOC_SETFLAGS:
+ {
+ unsigned int oldflags;
+
+ ret = mnt_want_write(filp->f_path.mnt);
+ if (ret)
+ return ret;
+
+ if (!inode_owner_or_capable(inode)) {
+ ret = -EACCES;
+ goto out;
+ }
+
+ if (get_user(flags, (int __user *) arg)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ flags = f2fs_mask_flags(inode->i_mode, flags);
+
+ mutex_lock(&inode->i_mutex);
+
+ oldflags = fi->i_flags;
+
+ if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+ if (!capable(CAP_LINUX_IMMUTABLE)) {
+ mutex_unlock(&inode->i_mutex);
+ ret = -EPERM;
+ goto out;
+ }
+ }
+
+ flags = flags & FS_FL_USER_MODIFIABLE;
+ flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+ fi->i_flags = flags;
+ mutex_unlock(&inode->i_mutex);
+
+ f2fs_set_inode_flags(inode);
+ inode->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(inode);
+out:
+ mnt_drop_write(filp->f_path.mnt);
+ return ret;
+ }
+ default:
+ return -ENOTTY;
+ }
+}
+
+const struct file_operations f2fs_file_operations = {
+ .llseek = generic_file_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
+ .open = generic_file_open,
+ .mmap = f2fs_file_mmap,
+ .fsync = f2fs_sync_file,
+ .fallocate = f2fs_fallocate,
+ .unlocked_ioctl = f2fs_ioctl,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
+};
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:45:37

[permalink] [raw]

Subject: [PATCH 09/17] f2fs: add address space operations for data

This adds address space operations for data.

- F2FS supports readpages(), writepages(), and direct_IO().

- Because of out-of-place writes, f2fs_direct_IO() does not write data in place.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/data.c | 701 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 701 insertions(+)
create mode 100644 fs/f2fs/data.c

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
new file mode 100644
index 0000000..c2fd0a8
--- /dev/null
+++ b/fs/f2fs/data.c
@@ -0,0 +1,701 @@
+/**
+ * fs/f2fs/data.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/buffer_head.h>
+#include <linux/mpage.h>
+#include <linux/writeback.h>
+#include <linux/backing-dev.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+
+/**
+ * Lock ordering for the change of data block address:
+ * ->data_page
+ * ->node_page
+ * update block addresses in the node page
+ */
+static void __set_data_blkaddr(struct dnode_of_data *dn, block_t new_addr)
+{
+ struct f2fs_node *rn;
+ __le32 *addr_array;
+ struct page *node_page = dn->node_page;
+ unsigned int ofs_in_node = dn->ofs_in_node;
+
+ wait_on_page_writeback(node_page);
+
+ rn = (struct f2fs_node *)page_address(node_page);
+
+ /* Get physical address of data block */
+ addr_array = blkaddr_in_node(rn);
+ addr_array[ofs_in_node] = cpu_to_le32(new_addr);
+ set_page_dirty(node_page);
+}
+
+int reserve_new_block(struct dnode_of_data *dn)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dn->inode->i_sb);
+
+ if (is_inode_flag_set(F2FS_I(dn->inode), FI_NO_ALLOC))
+ return -EPERM;
+ if (!inc_valid_block_count(sbi, dn->inode, 1))
+ return -ENOSPC;
+
+ __set_data_blkaddr(dn, NEW_ADDR);
+ dn->data_blkaddr = NEW_ADDR;
+ sync_inode_page(dn);
+ return 0;
+}
+
+static int check_extent_cache(struct inode *inode, pgoff_t pgofs,
+ struct buffer_head *bh_result)
+{
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ pgoff_t start_fofs, end_fofs;
+ block_t start_blkaddr;
+
+ read_lock(&fi->ext.ext_lock);
+ if (fi->ext.len == 0) {
+ read_unlock(&fi->ext.ext_lock);
+ return 0;
+ }
+
+ sbi->total_hit_ext++;
+ start_fofs = fi->ext.fofs;
+ end_fofs = fi->ext.fofs + fi->ext.len - 1;
+ start_blkaddr = fi->ext.blk_addr;
+
+ if (pgofs >= start_fofs && pgofs <= end_fofs) {
+ unsigned int blkbits = inode->i_sb->s_blocksize_bits;
+ size_t count;
+
+ clear_buffer_new(bh_result);
+ map_bh(bh_result, inode->i_sb,
+ start_blkaddr + pgofs - start_fofs);
+ count = end_fofs - pgofs + 1;
+ if (count < (UINT_MAX >> blkbits))
+ bh_result->b_size = (count << blkbits);
+ else
+ bh_result->b_size = UINT_MAX;
+
+ sbi->read_hit_ext++;
+ read_unlock(&fi->ext.ext_lock);
+ return 1;
+ }
+ read_unlock(&fi->ext.ext_lock);
+ return 0;
+}
+
+void update_extent_cache(block_t blk_addr, struct dnode_of_data *dn)
+{
+ struct f2fs_inode_info *fi = F2FS_I(dn->inode);
+ pgoff_t fofs, start_fofs, end_fofs;
+ block_t start_blkaddr, end_blkaddr;
+
+ BUG_ON(blk_addr == NEW_ADDR);
+ fofs = start_bidx_of_node(ofs_of_node(dn->node_page)) + dn->ofs_in_node;
+
+ /* Update the page address in the parent node */
+ __set_data_blkaddr(dn, blk_addr);
+
+ write_lock(&fi->ext.ext_lock);
+
+ start_fofs = fi->ext.fofs;
+ end_fofs = fi->ext.fofs + fi->ext.len - 1;
+ start_blkaddr = fi->ext.blk_addr;
+ end_blkaddr = fi->ext.blk_addr + fi->ext.len - 1;
+
+ /* Drop and initialize the matched extent */
+ if (fi->ext.len == 1 && fofs == start_fofs)
+ fi->ext.len = 0;
+
+ /* Initial extent */
+ if (fi->ext.len == 0) {
+ if (blk_addr != NULL_ADDR) {
+ fi->ext.fofs = fofs;
+ fi->ext.blk_addr = blk_addr;
+ fi->ext.len = 1;
+ }
+ goto end_update;
+ }
+
+ /* Frone merge */
+ if (fofs == start_fofs - 1 && blk_addr == start_blkaddr - 1) {
+ fi->ext.fofs--;
+ fi->ext.blk_addr--;
+ fi->ext.len++;
+ goto end_update;
+ }
+
+ /* Back merge */
+ if (fofs == end_fofs + 1 && blk_addr == end_blkaddr + 1) {
+ fi->ext.len++;
+ goto end_update;
+ }
+
+ /* Split the existing extent */
+ if (fi->ext.len > 1 &&
+ fofs >= start_fofs && fofs <= end_fofs) {
+ if ((end_fofs - fofs) < (fi->ext.len >> 1)) {
+ fi->ext.len = fofs - start_fofs;
+ } else {
+ fi->ext.fofs = fofs + 1;
+ fi->ext.blk_addr = start_blkaddr +
+ fofs - start_fofs + 1;
+ fi->ext.len -= fofs - start_fofs + 1;
+ }
+ goto end_update;
+ }
+ write_unlock(&fi->ext.ext_lock);
+ return;
+
+end_update:
+ write_unlock(&fi->ext.ext_lock);
+ sync_inode_page(dn);
+ return;
+}
+
+struct page *find_data_page(struct inode *inode, pgoff_t index)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct address_space *mapping = inode->i_mapping;
+ struct dnode_of_data dn;
+ struct page *page;
+ int err;
+
+ page = find_get_page(mapping, index);
+ if (page && PageUptodate(page))
+ return page;
+ f2fs_put_page(page, 0);
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, index, RDONLY_NODE);
+ if (err)
+ return ERR_PTR(err);
+ f2fs_put_dnode(&dn);
+
+ if (dn.data_blkaddr == NULL_ADDR)
+ return ERR_PTR(-ENOENT);
+
+ /* By fallocate(), there is no cached page, but with NEW_ADDR */
+ if (dn.data_blkaddr == NEW_ADDR)
+ return ERR_PTR(-EINVAL);
+
+ page = grab_cache_page(mapping, index);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+ }
+ unlock_page(page);
+ return page;
+}
+
+/**
+ * If it tries to access a hole, return an error.
+ * Because, the callers, functions in dir.c and GC, should be able to know
+ * whether this page exists or not.
+ */
+struct page *get_lock_data_page(struct inode *inode, pgoff_t index)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct address_space *mapping = inode->i_mapping;
+ struct dnode_of_data dn;
+ struct page *page;
+ int err;
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, index, RDONLY_NODE);
+ if (err)
+ return ERR_PTR(err);
+ f2fs_put_dnode(&dn);
+
+ if (dn.data_blkaddr == NULL_ADDR)
+ return ERR_PTR(-ENOENT);
+
+ page = grab_cache_page(mapping, index);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ if (PageUptodate(page))
+ return page;
+
+ BUG_ON(dn.data_blkaddr == NEW_ADDR);
+ BUG_ON(dn.data_blkaddr == NULL_ADDR);
+
+ err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+ }
+ return page;
+}
+
+/**
+ * Caller ensures that this data page is never allocated.
+ * A new zero-filled data page is allocated in the page cache.
+ */
+struct page *get_new_data_page(struct inode *inode, pgoff_t index,
+ bool new_i_size)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct address_space *mapping = inode->i_mapping;
+ struct page *page;
+ struct dnode_of_data dn;
+ int err;
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, index, 0);
+ if (err)
+ return ERR_PTR(err);
+
+ if (dn.data_blkaddr == NULL_ADDR) {
+ if (reserve_new_block(&dn)) {
+ f2fs_put_dnode(&dn);
+ return ERR_PTR(-ENOSPC);
+ }
+ }
+ f2fs_put_dnode(&dn);
+
+ page = grab_cache_page(mapping, index);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ if (PageUptodate(page))
+ return page;
+
+ if (dn.data_blkaddr == NEW_ADDR) {
+ zero_user_segment(page, 0, PAGE_CACHE_SIZE);
+ } else {
+ err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return ERR_PTR(err);
+ }
+ }
+ SetPageUptodate(page);
+
+ if (new_i_size &&
+ i_size_read(inode) < ((index + 1) << PAGE_CACHE_SHIFT)) {
+ i_size_write(inode, ((index + 1) << PAGE_CACHE_SHIFT));
+ mark_inode_dirty_sync(inode);
+ }
+ return page;
+}
+
+static void read_end_io(struct bio *bio, int err)
+{
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+
+ if (uptodate) {
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
+ } while (bvec >= bio->bi_io_vec);
+ kfree(bio->bi_private);
+ bio_put(bio);
+}
+
+/**
+ * Fill the locked page with data located in the block address.
+ * Read operation is synchronous, and caller must unlock the page.
+ */
+int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page,
+ block_t blk_addr, int type)
+{
+ struct block_device *bdev = sbi->sb->s_bdev;
+ bool sync = (type == READ_SYNC);
+ struct bio *bio;
+
+ /* This page can be already read by other threads */
+ if (PageUptodate(page)) {
+ if (!sync)
+ unlock_page(page);
+ return 0;
+ }
+
+ down_read(&sbi->bio_sem);
+
+ /* Allocate a new bio */
+ bio = f2fs_bio_alloc(bdev, blk_addr << (sbi->log_blocksize - 9),
+ 1, GFP_NOFS | __GFP_HIGH);
+
+ /* Initialize the bio */
+ bio->bi_end_io = read_end_io;
+ if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) < PAGE_CACHE_SIZE) {
+ kfree(bio->bi_private);
+ bio_put(bio);
+ up_read(&sbi->bio_sem);
+ return -EFAULT;
+ }
+
+ submit_bio(type, bio);
+ up_read(&sbi->bio_sem);
+
+ /* wait for read completion if sync */
+ if (sync) {
+ lock_page(page);
+ if (PageError(page))
+ return -EIO;
+ }
+ return 0;
+}
+
+/**
+ * This function should be used by the data read flow only where it
+ * does not check the "create" flag that indicates block allocation.
+ * The reason for this special functionality is to exploit VFS readahead
+ * mechanism.
+ */
+static int get_data_block_ro(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ unsigned int blkbits = inode->i_sb->s_blocksize_bits;
+ unsigned maxblocks = bh_result->b_size >> blkbits;
+ struct dnode_of_data dn;
+ pgoff_t pgofs;
+ int err;
+
+ /* Get the page offset from the block offset(iblock) */
+ pgofs = (pgoff_t)(iblock >> (PAGE_CACHE_SHIFT - blkbits));
+
+ if (check_extent_cache(inode, pgofs, bh_result))
+ return 0;
+
+ /* When reading holes, we need its node page */
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, pgofs, RDONLY_NODE);
+ if (err)
+ return (err == -ENOENT) ? 0 : err;
+
+ /* It does not support data allocation */
+ BUG_ON(create);
+
+ if (dn.data_blkaddr != NEW_ADDR && dn.data_blkaddr != NULL_ADDR) {
+ int i;
+ unsigned int end_offset;
+
+ end_offset = IS_INODE(dn.node_page) ?
+ ADDRS_PER_INODE :
+ ADDRS_PER_BLOCK;
+
+ clear_buffer_new(bh_result);
+
+ /* Give more consecutive addresses for the read ahead */
+ for (i = 0; i < end_offset - dn.ofs_in_node; i++)
+ if (((datablock_addr(dn.node_page,
+ dn.ofs_in_node + i))
+ != (dn.data_blkaddr + i)) || maxblocks == i)
+ break;
+ map_bh(bh_result, inode->i_sb, dn.data_blkaddr);
+ bh_result->b_size = (i << blkbits);
+ }
+ f2fs_put_dnode(&dn);
+ return 0;
+}
+
+static int f2fs_read_data_page(struct file *file, struct page *page)
+{
+ return mpage_readpage(page, get_data_block_ro);
+}
+
+static int f2fs_read_data_pages(struct file *file,
+ struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ return mpage_readpages(mapping, pages, nr_pages, get_data_block_ro);
+}
+
+int do_write_data_page(struct page *page)
+{
+ struct inode *inode = page->mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ block_t old_blk_addr, new_blk_addr;
+ struct dnode_of_data dn;
+ int err = 0;
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, page->index, RDONLY_NODE);
+ if (err)
+ return err;
+
+ old_blk_addr = dn.data_blkaddr;
+
+ /* This page is already truncated */
+ if (old_blk_addr == NULL_ADDR)
+ goto out_writepage;
+
+ set_page_writeback(page);
+
+ /*
+ * If current allocation needs SSR,
+ * it had better in-place writes for updated data.
+ */
+ if (old_blk_addr != NEW_ADDR && !is_cold_data(page) &&
+ need_inplace_update(inode)) {
+ rewrite_data_page(F2FS_SB(inode->i_sb), page,
+ old_blk_addr);
+ } else {
+ write_data_page(inode, page, &dn,
+ old_blk_addr, &new_blk_addr);
+ update_extent_cache(new_blk_addr, &dn);
+ F2FS_I(inode)->data_version =
+ le64_to_cpu(F2FS_CKPT(sbi)->checkpoint_ver);
+ }
+out_writepage:
+ f2fs_put_dnode(&dn);
+ return err;
+}
+
+static int f2fs_write_data_page(struct page *page,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = page->mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ loff_t i_size = i_size_read(inode);
+ const pgoff_t end_index = ((unsigned long long) i_size)
+ >> PAGE_CACHE_SHIFT;
+ unsigned offset;
+ int err = 0;
+
+ if (page->index < end_index)
+ goto out;
+
+ /*
+ * If the offset is out-of-range of file size,
+ * this page does not have to be written to disk.
+ */
+ offset = i_size & (PAGE_CACHE_SIZE - 1);
+ if ((page->index >= end_index + 1) || !offset) {
+ if (S_ISDIR(inode->i_mode)) {
+ dec_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_dec_dirty_dents(inode);
+ }
+ goto unlock_out;
+ }
+
+ zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+out:
+ if (sbi->por_doing)
+ goto redirty_out;
+
+ if (wbc->for_reclaim && !S_ISDIR(inode->i_mode) && !is_cold_data(page))
+ goto redirty_out;
+
+ mutex_lock_op(sbi, DATA_WRITE);
+ if (S_ISDIR(inode->i_mode)) {
+ dec_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_dec_dirty_dents(inode);
+ }
+ err = do_write_data_page(page);
+ if (err && err != -ENOENT) {
+ wbc->pages_skipped++;
+ set_page_dirty(page);
+ }
+ mutex_unlock_op(sbi, DATA_WRITE);
+
+ if (wbc->for_reclaim)
+ f2fs_submit_bio(sbi, DATA, true);
+
+ if (err == -ENOENT)
+ goto unlock_out;
+
+ clear_cold_data(page);
+ unlock_page(page);
+
+ if (!wbc->for_reclaim && !S_ISDIR(inode->i_mode))
+ f2fs_balance_fs(sbi);
+ return 0;
+
+unlock_out:
+ unlock_page(page);
+ return (err == -ENOENT) ? 0 : err;
+
+redirty_out:
+ wbc->pages_skipped++;
+ set_page_dirty(page);
+ return AOP_WRITEPAGE_ACTIVATE;
+}
+
+#define MAX_DESIRED_PAGES_WP 4096
+
+int f2fs_write_data_pages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ int ret;
+ long excess_nrtw = 0, desired_nrtw;
+
+ if (wbc->nr_to_write < MAX_DESIRED_PAGES_WP) {
+ desired_nrtw = MAX_DESIRED_PAGES_WP;
+ excess_nrtw = desired_nrtw - wbc->nr_to_write;
+ wbc->nr_to_write = desired_nrtw;
+ }
+
+ if (!S_ISDIR(inode->i_mode))
+ mutex_lock(&sbi->writepages);
+ ret = generic_writepages(mapping, wbc);
+ if (!S_ISDIR(inode->i_mode))
+ mutex_unlock(&sbi->writepages);
+ f2fs_submit_bio(sbi, DATA, (wbc->sync_mode == WB_SYNC_ALL));
+
+ remove_dirty_dir_inode(inode);
+
+ wbc->nr_to_write -= excess_nrtw;
+ return ret;
+}
+
+static int f2fs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct inode *inode = mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct page *page;
+ pgoff_t index = ((unsigned long long) pos) >> PAGE_CACHE_SHIFT;
+ struct dnode_of_data dn;
+ int err = 0;
+
+ /* for nobh_write_end */
+ *fsdata = NULL;
+
+ f2fs_balance_fs(sbi);
+
+ page = grab_cache_page_write_begin(mapping, index, flags);
+ if (!page)
+ return -ENOMEM;
+ *pagep = page;
+
+ mutex_lock_op(sbi, DATA_NEW);
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ err = get_dnode_of_data(&dn, index, 0);
+ if (err) {
+ mutex_unlock_op(sbi, DATA_NEW);
+ f2fs_put_page(page, 1);
+ return err;
+ }
+
+ if (dn.data_blkaddr == NULL_ADDR) {
+ err = reserve_new_block(&dn);
+ if (err) {
+ f2fs_put_dnode(&dn);
+ mutex_unlock_op(sbi, DATA_NEW);
+ f2fs_put_page(page, 1);
+ return err;
+ }
+ }
+ f2fs_put_dnode(&dn);
+
+ mutex_unlock_op(sbi, DATA_NEW);
+
+ if ((len == PAGE_CACHE_SIZE) || PageUptodate(page))
+ return 0;
+
+ if ((pos & PAGE_CACHE_MASK) >= i_size_read(inode)) {
+ unsigned start = pos & (PAGE_CACHE_SIZE - 1);
+ unsigned end = start + len;
+
+ /* Reading beyond i_size is simple: memset to zero */
+ zero_user_segments(page, 0, start, end, PAGE_CACHE_SIZE);
+ return 0;
+ }
+
+ if (dn.data_blkaddr == NEW_ADDR) {
+ zero_user_segment(page, 0, PAGE_CACHE_SIZE);
+ } else {
+ err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
+ if (err) {
+ f2fs_put_page(page, 1);
+ return err;
+ }
+ }
+ SetPageUptodate(page);
+ clear_cold_data(page);
+ return 0;
+}
+
+static ssize_t f2fs_direct_IO(int rw, struct kiocb *iocb,
+ const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+
+ if (rw == WRITE)
+ return 0;
+
+ /* Needs synchronization with the cleaner */
+ return blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ get_data_block_ro);
+}
+
+static void f2fs_invalidate_data_page(struct page *page, unsigned long offset)
+{
+ struct inode *inode = page->mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ if (S_ISDIR(inode->i_mode) && PageDirty(page)) {
+ dec_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_dec_dirty_dents(inode);
+ }
+ ClearPagePrivate(page);
+}
+
+static int f2fs_release_data_page(struct page *page, gfp_t wait)
+{
+ ClearPagePrivate(page);
+ return 0;
+}
+
+static int f2fs_set_data_page_dirty(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+
+ SetPageUptodate(page);
+ if (!PageDirty(page)) {
+ __set_page_dirty_nobuffers(page);
+ set_dirty_dir_page(inode, page);
+ return 1;
+ }
+ return 0;
+}
+
+const struct address_space_operations f2fs_dblock_aops = {
+ .readpage = f2fs_read_data_page,
+ .readpages = f2fs_read_data_pages,
+ .writepage = f2fs_write_data_page,
+ .writepages = f2fs_write_data_pages,
+ .write_begin = f2fs_write_begin,
+ .write_end = nobh_write_end,
+ .set_page_dirty = f2fs_set_data_page_dirty,
+ .invalidatepage = f2fs_invalidate_data_page,
+ .releasepage = f2fs_release_data_page,
+ .direct_IO = f2fs_direct_IO,
+};
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:46:14

[permalink] [raw]

Subject: [PATCH 10/17] f2fs: add core inode operations

This adds core functions to get, read, write, and evict an inode.

Signed-off-by: Changman Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/inode.c | 266 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 266 insertions(+)
create mode 100644 fs/f2fs/inode.c

diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
new file mode 100644
index 0000000..94f13d2
--- /dev/null
+++ b/fs/f2fs/inode.c
@@ -0,0 +1,266 @@
+/**
+ * fs/f2fs/inode.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/buffer_head.h>
+#include <linux/writeback.h>
+
+#include "f2fs.h"
+#include "node.h"
+
+struct f2fs_iget_args {
+ u64 ino;
+ int on_free;
+};
+
+void f2fs_set_inode_flags(struct inode *inode)
+{
+ unsigned int flags = F2FS_I(inode)->i_flags;
+
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE |
+ S_NOATIME | S_DIRSYNC);
+
+ if (flags & FS_SYNC_FL)
+ inode->i_flags |= S_SYNC;
+ if (flags & FS_APPEND_FL)
+ inode->i_flags |= S_APPEND;
+ if (flags & FS_IMMUTABLE_FL)
+ inode->i_flags |= S_IMMUTABLE;
+ if (flags & FS_NOATIME_FL)
+ inode->i_flags |= S_NOATIME;
+ if (flags & FS_DIRSYNC_FL)
+ inode->i_flags |= S_DIRSYNC;
+}
+
+static int f2fs_iget_test(struct inode *inode, void *data)
+{
+ struct f2fs_iget_args *args = data;
+
+ if (inode->i_ino != args->ino)
+ return 0;
+ if (inode->i_state & (I_FREEING | I_WILL_FREE)) {
+ args->on_free = 1;
+ return 0;
+ }
+ return 1;
+}
+
+struct inode *f2fs_iget_nowait(struct super_block *sb, unsigned long ino)
+{
+ struct f2fs_iget_args args = {
+ .ino = ino,
+ .on_free = 0
+ };
+ struct inode *inode = ilookup5(sb, ino, f2fs_iget_test, &args);
+
+ if (inode)
+ return inode;
+ if (!args.on_free)
+ return f2fs_iget(sb, ino);
+ return ERR_PTR(-ENOENT);
+}
+
+static int do_read_inode(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct page *node_page;
+ struct f2fs_node *rn;
+ struct f2fs_inode *ri;
+
+ /* Check if ino is within scope */
+ check_nid_range(sbi, inode->i_ino);
+
+ node_page = get_node_page(sbi, inode->i_ino);
+ if (IS_ERR(node_page))
+ return PTR_ERR(node_page);
+
+ rn = page_address(node_page);
+ ri = &(rn->i);
+
+ inode->i_mode = le16_to_cpu(ri->i_mode);
+ i_uid_write(inode, le32_to_cpu(ri->i_uid));
+ i_gid_write(inode, le32_to_cpu(ri->i_gid));
+ set_nlink(inode, le32_to_cpu(ri->i_links));
+ inode->i_size = le64_to_cpu(ri->i_size);
+ inode->i_blocks = le64_to_cpu(ri->i_blocks);
+
+ inode->i_atime.tv_sec = le64_to_cpu(ri->i_atime);
+ inode->i_ctime.tv_sec = le64_to_cpu(ri->i_ctime);
+ inode->i_mtime.tv_sec = le64_to_cpu(ri->i_mtime);
+ inode->i_atime.tv_nsec = le32_to_cpu(ri->i_atime_nsec);
+ inode->i_ctime.tv_nsec = le32_to_cpu(ri->i_ctime_nsec);
+ inode->i_mtime.tv_nsec = le32_to_cpu(ri->i_mtime_nsec);
+ inode->i_generation = le32_to_cpu(ri->i_generation);
+
+ fi->i_current_depth = le32_to_cpu(ri->i_current_depth);
+ fi->i_xattr_nid = le32_to_cpu(ri->i_xattr_nid);
+ fi->i_flags = le32_to_cpu(ri->i_flags);
+ fi->flags = 0;
+ fi->data_version = le64_to_cpu(F2FS_CKPT(sbi)->checkpoint_ver) - 1;
+ fi->i_advise = ri->i_advise;
+ get_extent_info(&fi->ext, ri->i_ext);
+ f2fs_put_page(node_page, 1);
+ return 0;
+}
+
+struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode;
+ int ret;
+
+ inode = iget_locked(sb, ino);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+ if (!(inode->i_state & I_NEW))
+ return inode;
+ if (ino == F2FS_NODE_INO(sbi) || ino == F2FS_META_INO(sbi))
+ goto make_now;
+
+ ret = do_read_inode(inode);
+ if (ret)
+ goto bad_inode;
+
+ if (!sbi->por_doing && inode->i_nlink == 0) {
+ ret = -ENOENT;
+ goto bad_inode;
+ }
+
+make_now:
+ if (ino == F2FS_NODE_INO(sbi)) {
+ inode->i_mapping->a_ops = &f2fs_node_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
+ } else if (ino == F2FS_META_INO(sbi)) {
+ inode->i_mapping->a_ops = &f2fs_meta_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
+ } else if (S_ISREG(inode->i_mode)) {
+ inode->i_op = &f2fs_file_inode_operations;
+ inode->i_fop = &f2fs_file_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+ } else if (S_ISDIR(inode->i_mode)) {
+ inode->i_op = &f2fs_dir_inode_operations;
+ inode->i_fop = &f2fs_dir_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER_MOVABLE |
+ __GFP_ZERO);
+ } else if (S_ISLNK(inode->i_mode)) {
+ inode->i_op = &f2fs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+ } else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+ S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
+ inode->i_op = &f2fs_special_inode_operations;
+ init_special_inode(inode, inode->i_mode, inode->i_rdev);
+ } else {
+ ret = -EIO;
+ goto bad_inode;
+ }
+ unlock_new_inode(inode);
+
+ return inode;
+
+bad_inode:
+ iget_failed(inode);
+ return ERR_PTR(ret);
+}
+
+void update_inode(struct inode *inode, struct page *node_page)
+{
+ struct f2fs_node *rn;
+ struct f2fs_inode *ri;
+
+ wait_on_page_writeback(node_page);
+
+ rn = page_address(node_page);
+ ri = &(rn->i);
+
+ ri->i_mode = cpu_to_le16(inode->i_mode);
+ ri->i_advise = F2FS_I(inode)->i_advise;
+ ri->i_uid = cpu_to_le32(i_uid_read(inode));
+ ri->i_gid = cpu_to_le32(i_gid_read(inode));
+ ri->i_links = cpu_to_le32(inode->i_nlink);
+ ri->i_size = cpu_to_le64(i_size_read(inode));
+ ri->i_blocks = cpu_to_le64(inode->i_blocks);
+ set_raw_extent(&F2FS_I(inode)->ext, &ri->i_ext);
+
+ ri->i_atime = cpu_to_le64(inode->i_atime.tv_sec);
+ ri->i_ctime = cpu_to_le64(inode->i_ctime.tv_sec);
+ ri->i_mtime = cpu_to_le64(inode->i_mtime.tv_sec);
+ ri->i_atime_nsec = cpu_to_le32(inode->i_atime.tv_nsec);
+ ri->i_ctime_nsec = cpu_to_le32(inode->i_ctime.tv_nsec);
+ ri->i_mtime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec);
+ ri->i_current_depth = cpu_to_le32(F2FS_I(inode)->i_current_depth);
+ ri->i_xattr_nid = cpu_to_le32(F2FS_I(inode)->i_xattr_nid);
+ ri->i_flags = cpu_to_le32(F2FS_I(inode)->i_flags);
+ ri->i_generation = cpu_to_le32(inode->i_generation);
+ set_page_dirty(node_page);
+}
+
+int f2fs_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct page *node_page;
+ bool need_lock = false;
+
+ if (inode->i_ino == F2FS_NODE_INO(sbi) ||
+ inode->i_ino == F2FS_META_INO(sbi))
+ return 0;
+
+ node_page = get_node_page(sbi, inode->i_ino);
+ if (IS_ERR(node_page))
+ return PTR_ERR(node_page);
+
+ if (!PageDirty(node_page)) {
+ need_lock = true;
+ f2fs_put_page(node_page, 1);
+ mutex_lock(&sbi->write_inode);
+ node_page = get_node_page(sbi, inode->i_ino);
+ if (IS_ERR(node_page)) {
+ mutex_unlock(&sbi->write_inode);
+ return PTR_ERR(node_page);
+ }
+ }
+ update_inode(inode, node_page);
+ f2fs_put_page(node_page, 1);
+ if (need_lock)
+ mutex_unlock(&sbi->write_inode);
+ return 0;
+}
+
+/**
+ * Called at the last iput() if i_nlink is zero
+ */
+void f2fs_evict_inode(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+
+ truncate_inode_pages(&inode->i_data, 0);
+
+ if (inode->i_ino == F2FS_NODE_INO(sbi) ||
+ inode->i_ino == F2FS_META_INO(sbi))
+ goto no_delete;
+
+ BUG_ON(atomic_read(&F2FS_I(inode)->dirty_dents));
+ remove_dirty_dir_inode(inode);
+
+ if (inode->i_nlink || is_bad_inode(inode))
+ goto no_delete;
+
+ set_inode_flag(F2FS_I(inode), FI_NO_ALLOC);
+ i_size_write(inode, 0);
+
+ if (F2FS_HAS_BLOCKS(inode))
+ f2fs_truncate(inode);
+
+ remove_inode_page(inode);
+no_delete:
+ clear_inode(inode);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:47:07

[permalink] [raw]

Subject: [PATCH 11/17] f2fs: add inode operations for special inodes

This adds inode operations for directory, symlink, and special inodes.

Signed-off-by: Changman Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/namei.c | 504 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 504 insertions(+)
create mode 100644 fs/f2fs/namei.c

diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
new file mode 100644
index 0000000..aec362f
--- /dev/null
+++ b/fs/f2fs/namei.c
@@ -0,0 +1,504 @@
+/**
+ * fs/f2fs/namei.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/pagemap.h>
+#include <linux/sched.h>
+#include <linux/ctype.h>
+
+#include "f2fs.h"
+#include "xattr.h"
+#include "acl.h"
+
+static struct inode *f2fs_new_inode(struct inode *dir, umode_t mode)
+{
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ nid_t ino;
+ struct inode *inode;
+ bool nid_free = false;
+ int err;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_lock_op(sbi, NODE_NEW);
+ if (!alloc_nid(sbi, &ino)) {
+ mutex_unlock_op(sbi, NODE_NEW);
+ err = -ENOSPC;
+ goto fail;
+ }
+ mutex_unlock_op(sbi, NODE_NEW);
+
+ inode->i_uid = current_fsuid();
+
+ if (dir->i_mode & S_ISGID) {
+ inode->i_gid = dir->i_gid;
+ if (S_ISDIR(mode))
+ mode |= S_ISGID;
+ } else {
+ inode->i_gid = current_fsgid();
+ }
+
+ inode->i_ino = ino;
+ inode->i_mode = mode;
+ inode->i_blocks = 0;
+ inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+ inode->i_generation = sbi->s_next_generation++;
+
+ err = insert_inode_locked(inode);
+ if (err) {
+ err = -EINVAL;
+ nid_free = true;
+ goto out;
+ }
+
+ mark_inode_dirty(inode);
+ return inode;
+
+out:
+ clear_nlink(inode);
+ unlock_new_inode(inode);
+fail:
+ iput(inode);
+ if (nid_free)
+ alloc_nid_failed(sbi, ino);
+ return ERR_PTR(err);
+}
+
+static int is_multimedia_file(const unsigned char *s, const char *sub)
+{
+ int slen = strlen(s);
+ int sublen = strlen(sub);
+ int ret;
+
+ if (sublen > slen)
+ return 1;
+
+ ret = memcmp(s + slen - sublen, sub, sublen);
+ if (ret) { /* compare upper case */
+ int i;
+ char upper_sub[8];
+ for (i = 0; i < sublen && i < sizeof(upper_sub); i++)
+ upper_sub[i] = toupper(sub[i]);
+ return memcmp(s + slen - sublen, upper_sub, sublen);
+ }
+
+ return ret;
+}
+
+/**
+ * Set multimedia files as cold files for hot/cold data separation
+ */
+static inline void set_cold_file(struct f2fs_sb_info *sbi, struct inode *inode,
+ const unsigned char *name)
+{
+ int i;
+ __u8 (*extlist)[8] = sbi->raw_super->extension_list;
+
+ int count = le32_to_cpu(sbi->raw_super->extension_count);
+ for (i = 0; i < count; i++) {
+ if (!is_multimedia_file(name, extlist[i])) {
+ F2FS_I(inode)->i_advise |= FADVISE_COLD_BIT;
+ break;
+ }
+ }
+}
+
+static int f2fs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+ bool excl)
+{
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode;
+ nid_t ino = 0;
+ int err;
+
+ inode = f2fs_new_inode(dir, mode);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ if (!test_opt(sbi, DISABLE_EXT_IDENTIFY))
+ set_cold_file(sbi, inode, dentry->d_name.name);
+
+ inode->i_op = &f2fs_file_inode_operations;
+ inode->i_fop = &f2fs_file_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+ ino = inode->i_ino;
+
+ err = f2fs_add_link(dentry, inode);
+ if (err)
+ goto out;
+
+ alloc_nid_done(sbi, ino);
+
+ if (!sbi->por_doing)
+ d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
+
+ f2fs_balance_fs(sbi);
+ return 0;
+out:
+ clear_nlink(inode);
+ unlock_new_inode(inode);
+ iput(inode);
+ alloc_nid_failed(sbi, ino);
+ return err;
+}
+
+static int f2fs_link(struct dentry *old_dentry, struct inode *dir,
+ struct dentry *dentry)
+{
+ struct inode *inode = old_dentry->d_inode;
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ int err;
+
+ inode->i_ctime = CURRENT_TIME;
+ atomic_inc(&inode->i_count);
+
+ set_inode_flag(F2FS_I(inode), FI_INC_LINK);
+ err = f2fs_add_link(dentry, inode);
+ if (err)
+ goto out;
+
+ d_instantiate(dentry, inode);
+
+ f2fs_balance_fs(sbi);
+ return 0;
+out:
+ clear_inode_flag(F2FS_I(inode), FI_INC_LINK);
+ iput(inode);
+ return err;
+}
+
+struct dentry *f2fs_get_parent(struct dentry *child)
+{
+ struct qstr dotdot = QSTR_INIT("..", 2);
+ unsigned long ino = f2fs_inode_by_name(child->d_inode, &dotdot);
+ if (!ino)
+ return ERR_PTR(-ENOENT);
+ return d_obtain_alias(f2fs_iget(child->d_inode->i_sb, ino));
+}
+
+static struct dentry *f2fs_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct inode *inode = NULL;
+ struct f2fs_dir_entry *de;
+ struct page *page;
+
+ if (dentry->d_name.len > F2FS_MAX_NAME_LEN)
+ return ERR_PTR(-ENAMETOOLONG);
+
+ de = f2fs_find_entry(dir, &dentry->d_name, &page);
+ if (de) {
+ nid_t ino = le32_to_cpu(de->ino);
+ kunmap(page);
+ f2fs_put_page(page, 0);
+
+ inode = f2fs_iget(dir->i_sb, ino);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ }
+
+ return d_splice_alias(inode, dentry);
+}
+
+static int f2fs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode = dentry->d_inode;
+ struct f2fs_dir_entry *de;
+ struct page *page;
+ int err = -ENOENT;
+
+ de = f2fs_find_entry(dir, &dentry->d_name, &page);
+ if (!de)
+ goto fail;
+
+ err = check_orphan_space(sbi);
+ if (err) {
+ kunmap(page);
+ f2fs_put_page(page, 0);
+ goto fail;
+ }
+
+ f2fs_delete_entry(de, page, inode);
+
+ /* In order to evict this inode, we set it dirty */
+ mark_inode_dirty(inode);
+ f2fs_balance_fs(sbi);
+fail:
+ return err;
+}
+
+static int f2fs_symlink(struct inode *dir, struct dentry *dentry,
+ const char *symname)
+{
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode;
+ unsigned symlen = strlen(symname) + 1;
+ int err;
+
+ inode = f2fs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ inode->i_op = &f2fs_symlink_inode_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+
+ err = f2fs_add_link(dentry, inode);
+ if (err)
+ goto out;
+
+ err = page_symlink(inode, symname, symlen);
+ alloc_nid_done(sbi, inode->i_ino);
+
+ d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
+
+ f2fs_balance_fs(sbi);
+
+ return err;
+out:
+ clear_nlink(inode);
+ unlock_new_inode(inode);
+ iput(inode);
+ alloc_nid_failed(sbi, inode->i_ino);
+ return err;
+}
+
+static int f2fs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+ struct inode *inode;
+ int err;
+
+ inode = f2fs_new_inode(dir, S_IFDIR | mode);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ return err;
+
+ inode->i_op = &f2fs_dir_inode_operations;
+ inode->i_fop = &f2fs_dir_operations;
+ inode->i_mapping->a_ops = &f2fs_dblock_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS | __GFP_ZERO);
+
+ set_inode_flag(F2FS_I(inode), FI_INC_LINK);
+ err = f2fs_add_link(dentry, inode);
+ if (err)
+ goto out_fail;
+
+ alloc_nid_done(sbi, inode->i_ino);
+
+ d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
+
+ f2fs_balance_fs(sbi);
+ return 0;
+
+out_fail:
+ clear_inode_flag(F2FS_I(inode), FI_INC_LINK);
+ clear_nlink(inode);
+ unlock_new_inode(inode);
+ iput(inode);
+ alloc_nid_failed(sbi, inode->i_ino);
+ return err;
+}
+
+static int f2fs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ if (f2fs_empty_dir(inode))
+ return f2fs_unlink(dir, dentry);
+ return -ENOTEMPTY;
+}
+
+static int f2fs_mknod(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t rdev)
+{
+ struct super_block *sb = dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *inode;
+ int err = 0;
+
+ if (!new_valid_dev(rdev))
+ return -EINVAL;
+
+ inode = f2fs_new_inode(dir, mode);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ init_special_inode(inode, inode->i_mode, rdev);
+ inode->i_op = &f2fs_special_inode_operations;
+
+ err = f2fs_add_link(dentry, inode);
+ if (err)
+ goto out;
+
+ alloc_nid_done(sbi, inode->i_ino);
+ d_instantiate(dentry, inode);
+ unlock_new_inode(inode);
+
+ f2fs_balance_fs(sbi);
+
+ return 0;
+out:
+ clear_nlink(inode);
+ unlock_new_inode(inode);
+ iput(inode);
+ alloc_nid_failed(sbi, inode->i_ino);
+ return err;
+}
+
+static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry)
+{
+ struct super_block *sb = old_dir->i_sb;
+ struct f2fs_sb_info *sbi = F2FS_SB(sb);
+ struct inode *old_inode = old_dentry->d_inode;
+ struct inode *new_inode = new_dentry->d_inode;
+ struct page *old_dir_page;
+ struct page *old_page;
+ struct f2fs_dir_entry *old_dir_entry = NULL;
+ struct f2fs_dir_entry *old_entry;
+ struct f2fs_dir_entry *new_entry;
+ int err = -ENOENT;
+
+ old_entry = f2fs_find_entry(old_dir, &old_dentry->d_name, &old_page);
+ if (!old_entry)
+ goto out;
+
+ if (S_ISDIR(old_inode->i_mode)) {
+ err = -EIO;
+ old_dir_entry = f2fs_parent_dir(old_inode, &old_dir_page);
+ if (!old_dir_entry)
+ goto out_old;
+ }
+
+ mutex_lock_op(sbi, RENAME);
+
+ if (new_inode) {
+ struct page *new_page;
+
+ err = -ENOTEMPTY;
+ if (old_dir_entry && !f2fs_empty_dir(new_inode))
+ goto out_dir;
+
+ err = -ENOENT;
+ new_entry = f2fs_find_entry(new_dir, &new_dentry->d_name,
+ &new_page);
+ if (!new_entry)
+ goto out_dir;
+
+ f2fs_set_link(new_dir, new_entry, new_page, old_inode);
+
+ new_inode->i_ctime = CURRENT_TIME;
+ if (old_dir_entry)
+ drop_nlink(new_inode);
+ drop_nlink(new_inode);
+ if (!new_inode->i_nlink)
+ add_orphan_inode(sbi, new_inode->i_ino);
+ f2fs_write_inode(new_inode, NULL);
+ } else {
+ err = f2fs_add_link(new_dentry, old_inode);
+ if (err)
+ goto out_dir;
+
+ if (old_dir_entry) {
+ inc_nlink(new_dir);
+ f2fs_write_inode(new_dir, NULL);
+ }
+ }
+
+ old_inode->i_ctime = CURRENT_TIME;
+ set_inode_flag(F2FS_I(old_inode), FI_NEED_CP);
+ mark_inode_dirty(old_inode);
+
+ f2fs_delete_entry(old_entry, old_page, NULL);
+
+ if (old_dir_entry) {
+ if (old_dir != new_dir) {
+ f2fs_set_link(old_inode, old_dir_entry,
+ old_dir_page, new_dir);
+ } else {
+ kunmap(old_dir_page);
+ f2fs_put_page(old_dir_page, 0);
+ }
+ drop_nlink(old_dir);
+ f2fs_write_inode(old_dir, NULL);
+ }
+
+ mutex_unlock_op(sbi, RENAME);
+
+ f2fs_balance_fs(sbi);
+ return 0;
+
+out_dir:
+ if (old_dir_entry) {
+ kunmap(old_dir_page);
+ f2fs_put_page(old_dir_page, 0);
+ }
+ mutex_unlock_op(sbi, RENAME);
+out_old:
+ kunmap(old_page);
+ f2fs_put_page(old_page, 0);
+out:
+ return err;
+}
+
+const struct inode_operations f2fs_dir_inode_operations = {
+ .create = f2fs_create,
+ .lookup = f2fs_lookup,
+ .link = f2fs_link,
+ .unlink = f2fs_unlink,
+ .symlink = f2fs_symlink,
+ .mkdir = f2fs_mkdir,
+ .rmdir = f2fs_rmdir,
+ .mknod = f2fs_mknod,
+ .rename = f2fs_rename,
+ .setattr = f2fs_setattr,
+ .get_acl = f2fs_get_acl,
+#ifdef CONFIG_F2FS_FS_XATTR
+ .setxattr = generic_setxattr,
+ .getxattr = generic_getxattr,
+ .listxattr = f2fs_listxattr,
+ .removexattr = generic_removexattr,
+#endif
+};
+
+const struct inode_operations f2fs_symlink_inode_operations = {
+ .readlink = generic_readlink,
+ .follow_link = page_follow_link_light,
+ .put_link = page_put_link,
+ .setattr = f2fs_setattr,
+#ifdef CONFIG_F2FS_FS_XATTR
+ .setxattr = generic_setxattr,
+ .getxattr = generic_getxattr,
+ .listxattr = f2fs_listxattr,
+ .removexattr = generic_removexattr,
+#endif
+};
+
+const struct inode_operations f2fs_special_inode_operations = {
+ .setattr = f2fs_setattr,
+ .get_acl = f2fs_get_acl,
+#ifdef CONFIG_F2FS_FS_XATTR
+ .setxattr = generic_setxattr,
+ .getxattr = generic_getxattr,
+ .listxattr = f2fs_listxattr,
+ .removexattr = generic_removexattr,
+#endif
+};
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:47:33

[permalink] [raw]

Subject: [PATCH 12/17] f2fs: add core directory operations

This adds core functions to find, add, delete, and link dentries.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/dir.c | 674 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/hash.c | 98 ++++++++
2 files changed, 772 insertions(+)
create mode 100644 fs/f2fs/dir.c
create mode 100644 fs/f2fs/hash.c

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
new file mode 100644
index 0000000..94eddf2
--- /dev/null
+++ b/fs/f2fs/dir.c
@@ -0,0 +1,674 @@
+/**
+ * fs/f2fs/dir.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include "f2fs.h"
+#include "acl.h"
+
+static unsigned long dir_blocks(struct inode *inode)
+{
+ return ((unsigned long long) (i_size_read(inode) + PAGE_CACHE_SIZE - 1))
+ >> PAGE_CACHE_SHIFT;
+}
+
+static unsigned int dir_buckets(unsigned int level)
+{
+ if (level < MAX_DIR_HASH_DEPTH / 2)
+ return 1 << level;
+ else
+ return 1 << ((MAX_DIR_HASH_DEPTH / 2) - 1);
+}
+
+static unsigned int bucket_blocks(unsigned int level)
+{
+ if (level < MAX_DIR_HASH_DEPTH / 2)
+ return 2;
+ else
+ return 4;
+}
+
+static unsigned char f2fs_filetype_table[F2FS_FT_MAX] = {
+ [F2FS_FT_UNKNOWN] = DT_UNKNOWN,
+ [F2FS_FT_REG_FILE] = DT_REG,
+ [F2FS_FT_DIR] = DT_DIR,
+ [F2FS_FT_CHRDEV] = DT_CHR,
+ [F2FS_FT_BLKDEV] = DT_BLK,
+ [F2FS_FT_FIFO] = DT_FIFO,
+ [F2FS_FT_SOCK] = DT_SOCK,
+ [F2FS_FT_SYMLINK] = DT_LNK,
+};
+
+#define S_SHIFT 12
+static unsigned char f2fs_type_by_mode[S_IFMT >> S_SHIFT] = {
+ [S_IFREG >> S_SHIFT] = F2FS_FT_REG_FILE,
+ [S_IFDIR >> S_SHIFT] = F2FS_FT_DIR,
+ [S_IFCHR >> S_SHIFT] = F2FS_FT_CHRDEV,
+ [S_IFBLK >> S_SHIFT] = F2FS_FT_BLKDEV,
+ [S_IFIFO >> S_SHIFT] = F2FS_FT_FIFO,
+ [S_IFSOCK >> S_SHIFT] = F2FS_FT_SOCK,
+ [S_IFLNK >> S_SHIFT] = F2FS_FT_SYMLINK,
+};
+
+static void set_de_type(struct f2fs_dir_entry *de, struct inode *inode)
+{
+ mode_t mode = inode->i_mode;
+ de->file_type = f2fs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
+}
+
+static unsigned long dir_block_index(unsigned int level, unsigned int idx)
+{
+ unsigned long i;
+ unsigned long bidx = 0;
+
+ for (i = 0; i < level; i++)
+ bidx += dir_buckets(i) * bucket_blocks(i);
+ bidx += idx * bucket_blocks(level);
+ return bidx;
+}
+
+static bool early_match_name(const char *name, int namelen,
+ f2fs_hash_t namehash, struct f2fs_dir_entry *de)
+{
+ if (le16_to_cpu(de->name_len) != namelen)
+ return false;
+
+ if (le32_to_cpu(de->hash_code) != namehash)
+ return false;
+
+ return true;
+}
+
+static struct f2fs_dir_entry *find_in_block(struct page *dentry_page,
+ const char *name, int namelen, int *max_slots,
+ f2fs_hash_t namehash, struct page **res_page)
+{
+ struct f2fs_dir_entry *de;
+ unsigned long bit_pos, end_pos, next_pos;
+ struct f2fs_dentry_block *dentry_blk = kmap(dentry_page);
+ int slots;
+
+ bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK, 0);
+ while (bit_pos < NR_DENTRY_IN_BLOCK) {
+ de = &dentry_blk->dentry[bit_pos];
+ slots = (le16_to_cpu(de->name_len) + F2FS_NAME_LEN - 1) /
+ F2FS_NAME_LEN;
+
+ if (early_match_name(name, namelen, namehash, de)) {
+ if (!memcmp(dentry_blk->filename[bit_pos],
+ name, namelen)) {
+ *res_page = dentry_page;
+ goto found;
+ }
+ }
+ next_pos = bit_pos + slots;
+ bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK, next_pos);
+ if (bit_pos >= NR_DENTRY_IN_BLOCK)
+ end_pos = NR_DENTRY_IN_BLOCK;
+ else
+ end_pos = bit_pos;
+ if (*max_slots < end_pos - next_pos)
+ *max_slots = end_pos - next_pos;
+ }
+
+ de = NULL;
+ kunmap(dentry_page);
+found:
+ return de;
+}
+
+static struct f2fs_dir_entry *find_in_level(struct inode *dir,
+ unsigned int level, const char *name, int namelen,
+ f2fs_hash_t namehash, struct page **res_page)
+{
+ int s = (namelen + F2FS_NAME_LEN - 1) / F2FS_NAME_LEN;
+ unsigned int nbucket, nblock;
+ unsigned int bidx, end_block;
+ struct page *dentry_page;
+ struct f2fs_dir_entry *de = NULL;
+ bool room = false;
+ int max_slots = 0;
+
+ BUG_ON(level > MAX_DIR_HASH_DEPTH);
+
+ nbucket = dir_buckets(level);
+ nblock = bucket_blocks(level);
+
+ bidx = dir_block_index(level, namehash % nbucket);
+ end_block = bidx + nblock;
+
+ for (; bidx < end_block; bidx++) {
+ /* no need to allocate new dentry pages to all the indices */
+ dentry_page = find_data_page(dir, bidx);
+ if (IS_ERR(dentry_page)) {
+ room = true;
+ continue;
+ }
+
+ de = find_in_block(dentry_page, name, namelen,
+ &max_slots, namehash, res_page);
+ if (de)
+ break;
+
+ if (max_slots >= s)
+ room = true;
+ f2fs_put_page(dentry_page, 0);
+ }
+
+ if (!de && room && F2FS_I(dir)->chash != namehash) {
+ F2FS_I(dir)->chash = namehash;
+ F2FS_I(dir)->clevel = level;
+ }
+
+ return de;
+}
+
+/*
+ * Find an entry in the specified directory with the wanted name.
+ * It returns the page where the entry was found (as a parameter - res_page),
+ * and the entry itself. Page is returned mapped and unlocked.
+ * Entry is guaranteed to be valid.
+ */
+struct f2fs_dir_entry *f2fs_find_entry(struct inode *dir,
+ struct qstr *child, struct page **res_page)
+{
+ const char *name = child->name;
+ int namelen = child->len;
+ unsigned long npages = dir_blocks(dir);
+ struct f2fs_dir_entry *de = NULL;
+ f2fs_hash_t name_hash;
+ unsigned int max_depth;
+ unsigned int level;
+
+ if (npages == 0)
+ return NULL;
+
+ *res_page = NULL;
+
+ name_hash = f2fs_dentry_hash(name, namelen);
+ max_depth = F2FS_I(dir)->i_current_depth;
+
+ for (level = 0; level < max_depth; level++) {
+ de = find_in_level(dir, level, name,
+ namelen, name_hash, res_page);
+ if (de)
+ break;
+ }
+ if (!de && F2FS_I(dir)->chash != name_hash) {
+ F2FS_I(dir)->chash = name_hash;
+ F2FS_I(dir)->clevel = level - 1;
+ }
+ return de;
+}
+
+struct f2fs_dir_entry *f2fs_parent_dir(struct inode *dir, struct page **p)
+{
+ struct page *page = NULL;
+ struct f2fs_dir_entry *de = NULL;
+ struct f2fs_dentry_block *dentry_blk = NULL;
+
+ page = get_lock_data_page(dir, 0);
+ if (IS_ERR(page))
+ return NULL;
+
+ dentry_blk = kmap(page);
+ de = &dentry_blk->dentry[1];
+ *p = page;
+ unlock_page(page);
+ return de;
+}
+
+ino_t f2fs_inode_by_name(struct inode *dir, struct qstr *qstr)
+{
+ ino_t res = 0;
+ struct f2fs_dir_entry *de;
+ struct page *page;
+
+ de = f2fs_find_entry(dir, qstr, &page);
+ if (de) {
+ res = le32_to_cpu(de->ino);
+ kunmap(page);
+ f2fs_put_page(page, 0);
+ }
+
+ return res;
+}
+
+void f2fs_set_link(struct inode *dir, struct f2fs_dir_entry *de,
+ struct page *page, struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+
+ mutex_lock_op(sbi, DENTRY_OPS);
+ lock_page(page);
+ wait_on_page_writeback(page);
+ de->ino = cpu_to_le32(inode->i_ino);
+ set_de_type(de, inode);
+ kunmap(page);
+ set_page_dirty(page);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ mark_inode_dirty(dir);
+ f2fs_put_page(page, 1);
+ mutex_unlock_op(sbi, DENTRY_OPS);
+}
+
+void init_dent_inode(struct dentry *dentry, struct page *ipage)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ struct f2fs_node *rn;
+
+ if (IS_ERR(ipage))
+ return;
+
+ wait_on_page_writeback(ipage);
+
+ /* copy dentry info. to this inode page */
+ rn = (struct f2fs_node *)page_address(ipage);
+ rn->i.i_pino = cpu_to_le32(dir->i_ino);
+ rn->i.i_namelen = cpu_to_le32(dentry->d_name.len);
+ memcpy(rn->i.i_name, dentry->d_name.name, dentry->d_name.len);
+ set_page_dirty(ipage);
+}
+
+static int init_inode_metadata(struct inode *inode, struct dentry *dentry)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+
+ if (is_inode_flag_set(F2FS_I(inode), FI_NEW_INODE)) {
+ int err;
+ err = new_inode_page(inode, dentry);
+ if (err)
+ return err;
+
+ if (S_ISDIR(inode->i_mode)) {
+ err = f2fs_make_empty(inode, dir);
+ if (err) {
+ remove_inode_page(inode);
+ return err;
+ }
+ }
+
+ err = f2fs_init_acl(inode, dir);
+ if (err) {
+ remove_inode_page(inode);
+ return err;
+ }
+ } else {
+ struct page *ipage;
+ ipage = get_node_page(F2FS_SB(dir->i_sb), inode->i_ino);
+ if (IS_ERR(ipage))
+ return PTR_ERR(ipage);
+ init_dent_inode(dentry, ipage);
+ f2fs_put_page(ipage, 1);
+ }
+ if (is_inode_flag_set(F2FS_I(inode), FI_INC_LINK)) {
+ inc_nlink(inode);
+ f2fs_write_inode(inode, NULL);
+ }
+ return 0;
+}
+
+static void update_parent_metadata(struct inode *dir, struct inode *inode,
+ unsigned int current_depth)
+{
+ bool need_dir_update = false;
+
+ if (is_inode_flag_set(F2FS_I(inode), FI_NEW_INODE)) {
+ if (S_ISDIR(inode->i_mode)) {
+ inc_nlink(dir);
+ need_dir_update = true;
+ }
+ clear_inode_flag(F2FS_I(inode), FI_NEW_INODE);
+ }
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+ if (F2FS_I(dir)->i_current_depth != current_depth) {
+ F2FS_I(dir)->i_current_depth = current_depth;
+ need_dir_update = true;
+ }
+
+ if (need_dir_update)
+ f2fs_write_inode(dir, NULL);
+ else
+ mark_inode_dirty(dir);
+
+ if (is_inode_flag_set(F2FS_I(inode), FI_INC_LINK))
+ clear_inode_flag(F2FS_I(inode), FI_INC_LINK);
+}
+
+static int room_for_filename(struct f2fs_dentry_block *dentry_blk, int slots)
+{
+ int bit_start = 0;
+ int zero_start, zero_end;
+next:
+ zero_start = find_next_zero_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK,
+ bit_start);
+ if (zero_start >= NR_DENTRY_IN_BLOCK)
+ return NR_DENTRY_IN_BLOCK;
+
+ zero_end = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK,
+ zero_start);
+ if (zero_end - zero_start >= slots)
+ return zero_start;
+
+ bit_start = zero_end + 1;
+
+ if (zero_end + 1 >= NR_DENTRY_IN_BLOCK)
+ return NR_DENTRY_IN_BLOCK;
+ goto next;
+}
+
+int f2fs_add_link(struct dentry *dentry, struct inode *inode)
+{
+ unsigned int bit_pos;
+ unsigned int level;
+ unsigned int current_depth;
+ unsigned long bidx, block;
+ f2fs_hash_t dentry_hash;
+ struct f2fs_dir_entry *de;
+ unsigned int nbucket, nblock;
+ struct inode *dir = dentry->d_parent->d_inode;
+ struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ struct page *dentry_page = NULL;
+ struct f2fs_dentry_block *dentry_blk = NULL;
+ int slots = (namelen + F2FS_NAME_LEN - 1) / F2FS_NAME_LEN;
+ int err = 0;
+ int i;
+
+ dentry_hash = f2fs_dentry_hash(name, dentry->d_name.len);
+ level = 0;
+ current_depth = F2FS_I(dir)->i_current_depth;
+ if (F2FS_I(dir)->chash == dentry_hash) {
+ level = F2FS_I(dir)->clevel;
+ F2FS_I(dir)->chash = 0;
+ }
+
+start:
+ if (current_depth == MAX_DIR_HASH_DEPTH)
+ return -ENOSPC;
+
+ /* Increase the depth, if required */
+ if (level == current_depth)
+ ++current_depth;
+
+ nbucket = dir_buckets(level);
+ nblock = bucket_blocks(level);
+
+ bidx = dir_block_index(level, (dentry_hash % nbucket));
+
+ for (block = bidx; block <= (bidx + nblock - 1); block++) {
+ mutex_lock_op(sbi, DENTRY_OPS);
+ dentry_page = get_new_data_page(dir, block, true);
+ if (IS_ERR(dentry_page)) {
+ mutex_unlock_op(sbi, DENTRY_OPS);
+ return PTR_ERR(dentry_page);
+ }
+
+ dentry_blk = kmap(dentry_page);
+ bit_pos = room_for_filename(dentry_blk, slots);
+ if (bit_pos < NR_DENTRY_IN_BLOCK)
+ goto add_dentry;
+
+ kunmap(dentry_page);
+ f2fs_put_page(dentry_page, 1);
+ mutex_unlock_op(sbi, DENTRY_OPS);
+ }
+
+ /* Move to next level to find the empty slot for new dentry */
+ ++level;
+ goto start;
+add_dentry:
+ err = init_inode_metadata(inode, dentry);
+ if (err)
+ goto fail;
+
+ wait_on_page_writeback(dentry_page);
+
+ de = &dentry_blk->dentry[bit_pos];
+ de->hash_code = cpu_to_le32(dentry_hash);
+ de->name_len = cpu_to_le16(namelen);
+ memcpy(dentry_blk->filename[bit_pos], name, namelen);
+ de->ino = cpu_to_le32(inode->i_ino);
+ set_de_type(de, inode);
+ for (i = 0; i < slots; i++)
+ test_and_set_bit_le(bit_pos + i, &dentry_blk->dentry_bitmap);
+ set_page_dirty(dentry_page);
+ update_parent_metadata(dir, inode, current_depth);
+fail:
+ kunmap(dentry_page);
+ f2fs_put_page(dentry_page, 1);
+ mutex_unlock_op(sbi, DENTRY_OPS);
+ return err;
+}
+
+/**
+ * It only removes the dentry from the dentry page,corresponding name
+ * entry in name page does not need to be touched during deletion.
+ */
+void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
+ struct inode *inode)
+{
+ struct f2fs_dentry_block *dentry_blk;
+ unsigned int bit_pos;
+ struct address_space *mapping = page->mapping;
+ struct inode *dir = mapping->host;
+ struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+ int slots = (le16_to_cpu(dentry->name_len) + F2FS_NAME_LEN - 1) /
+ F2FS_NAME_LEN;
+ void *kaddr = page_address(page);
+ int i;
+
+ mutex_lock_op(sbi, DENTRY_OPS);
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+
+ dentry_blk = (struct f2fs_dentry_block *)kaddr;
+ bit_pos = dentry - (struct f2fs_dir_entry *)dentry_blk->dentry;
+ for (i = 0; i < slots; i++)
+ test_and_clear_bit_le(bit_pos + i, &dentry_blk->dentry_bitmap);
+
+ /* Let's check and deallocate this dentry page */
+ bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK,
+ 0);
+ kunmap(page); /* kunmap - pair of f2fs_find_entry */
+ set_page_dirty(page);
+
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+
+ if (inode && S_ISDIR(inode->i_mode)) {
+ drop_nlink(dir);
+ f2fs_write_inode(dir, NULL);
+ } else {
+ mark_inode_dirty(dir);
+ }
+
+ if (inode) {
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ drop_nlink(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ drop_nlink(inode);
+ i_size_write(inode, 0);
+ }
+ f2fs_write_inode(inode, NULL);
+ if (inode->i_nlink == 0)
+ add_orphan_inode(sbi, inode->i_ino);
+ }
+
+ if (bit_pos == NR_DENTRY_IN_BLOCK) {
+ loff_t page_offset;
+ truncate_hole(dir, page->index, page->index + 1);
+ clear_page_dirty_for_io(page);
+ ClearPageUptodate(page);
+ dec_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_dec_dirty_dents(dir);
+ page_offset = page->index << PAGE_CACHE_SHIFT;
+ f2fs_put_page(page, 1);
+ } else {
+ f2fs_put_page(page, 1);
+ }
+ mutex_unlock_op(sbi, DENTRY_OPS);
+}
+
+int f2fs_make_empty(struct inode *inode, struct inode *parent)
+{
+ struct page *dentry_page;
+ struct f2fs_dentry_block *dentry_blk;
+ struct f2fs_dir_entry *de;
+ void *kaddr;
+
+ dentry_page = get_new_data_page(inode, 0, true);
+ if (IS_ERR(dentry_page))
+ return PTR_ERR(dentry_page);
+
+ kaddr = kmap_atomic(dentry_page);
+ dentry_blk = (struct f2fs_dentry_block *)kaddr;
+
+ de = &dentry_blk->dentry[0];
+ de->name_len = cpu_to_le16(1);
+ de->hash_code = 0;
+ de->ino = cpu_to_le32(inode->i_ino);
+ memcpy(dentry_blk->filename[0], ".", 1);
+ set_de_type(de, inode);
+
+ de = &dentry_blk->dentry[1];
+ de->hash_code = 0;
+ de->name_len = cpu_to_le16(2);
+ de->ino = cpu_to_le32(parent->i_ino);
+ memcpy(dentry_blk->filename[1], "..", 2);
+ set_de_type(de, inode);
+
+ test_and_set_bit_le(0, &dentry_blk->dentry_bitmap);
+ test_and_set_bit_le(1, &dentry_blk->dentry_bitmap);
+ kunmap_atomic(kaddr);
+
+ set_page_dirty(dentry_page);
+ f2fs_put_page(dentry_page, 1);
+ return 0;
+}
+
+bool f2fs_empty_dir(struct inode *dir)
+{
+ unsigned long bidx;
+ struct page *dentry_page;
+ unsigned int bit_pos;
+ struct f2fs_dentry_block *dentry_blk;
+ unsigned long nblock = dir_blocks(dir);
+
+ for (bidx = 0; bidx < nblock; bidx++) {
+ void *kaddr;
+ dentry_page = get_lock_data_page(dir, bidx);
+ if (IS_ERR(dentry_page)) {
+ if (PTR_ERR(dentry_page) == -ENOENT)
+ continue;
+ else
+ return false;
+ }
+
+ kaddr = kmap_atomic(dentry_page);
+ dentry_blk = (struct f2fs_dentry_block *)kaddr;
+ if (bidx == 0)
+ bit_pos = 2;
+ else
+ bit_pos = 0;
+ bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK,
+ bit_pos);
+ kunmap_atomic(kaddr);
+
+ f2fs_put_page(dentry_page, 1);
+
+ if (bit_pos < NR_DENTRY_IN_BLOCK)
+ return false;
+ }
+ return true;
+}
+
+static int f2fs_readdir(struct file *file, void *dirent, filldir_t filldir)
+{
+ unsigned long pos = file->f_pos;
+ struct inode *inode = file->f_dentry->d_inode;
+ unsigned long npages = dir_blocks(inode);
+ unsigned char *types = NULL;
+ unsigned int bit_pos = 0, start_bit_pos = 0;
+ int over = 0;
+ struct f2fs_dentry_block *dentry_blk = NULL;
+ struct f2fs_dir_entry *de = NULL;
+ struct page *dentry_page = NULL;
+ unsigned int n = 0;
+ unsigned char *ptr = NULL;
+ unsigned char d_type = DT_UNKNOWN;
+ int slots;
+
+ types = f2fs_filetype_table;
+ bit_pos = (pos % NR_DENTRY_IN_BLOCK);
+ n = (pos / NR_DENTRY_IN_BLOCK);
+
+ for ( ; n < npages; n++) {
+ dentry_page = get_lock_data_page(inode, n);
+ if (IS_ERR(dentry_page))
+ continue;
+
+ start_bit_pos = bit_pos;
+ dentry_blk = kmap(dentry_page);
+ while (bit_pos < NR_DENTRY_IN_BLOCK) {
+ d_type = DT_UNKNOWN;
+ bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
+ NR_DENTRY_IN_BLOCK,
+ bit_pos);
+ if (bit_pos >= NR_DENTRY_IN_BLOCK)
+ break;
+
+ de = &dentry_blk->dentry[bit_pos];
+ if (types && de->file_type < F2FS_FT_MAX)
+ d_type = types[de->file_type];
+
+ ptr = dentry_blk->filename[bit_pos];
+ over = filldir(dirent,
+ dentry_blk->filename[bit_pos],
+ le16_to_cpu(de->name_len),
+ (n * NR_DENTRY_IN_BLOCK) + bit_pos,
+ le32_to_cpu(de->ino), d_type);
+ if (over) {
+ file->f_pos += bit_pos - start_bit_pos;
+ goto success;
+ }
+ slots = (le16_to_cpu(de->name_len) + F2FS_NAME_LEN - 1)
+ / F2FS_NAME_LEN;
+ bit_pos += slots;
+ }
+ bit_pos = 0;
+ file->f_pos = (n + 1) * NR_DENTRY_IN_BLOCK;
+ kunmap(dentry_page);
+ f2fs_put_page(dentry_page, 1);
+ dentry_page = NULL;
+ }
+success:
+ if (dentry_page && !IS_ERR(dentry_page)) {
+ kunmap(dentry_page);
+ f2fs_put_page(dentry_page, 1);
+ }
+
+ return 0;
+}
+
+const struct file_operations f2fs_dir_operations = {
+ .llseek = generic_file_llseek,
+ .read = generic_read_dir,
+ .readdir = f2fs_readdir,
+ .fsync = f2fs_sync_file,
+ .unlocked_ioctl = f2fs_ioctl,
+};
diff --git a/fs/f2fs/hash.c b/fs/f2fs/hash.c
new file mode 100644
index 0000000..098a196
--- /dev/null
+++ b/fs/f2fs/hash.c
@@ -0,0 +1,98 @@
+/**
+ * fs/f2fs/hash.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * Portions of this code from linux/fs/ext3/hash.c
+ *
+ * Copyright (C) 2002 by Theodore Ts'o
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/cryptohash.h>
+#include <linux/pagemap.h>
+
+#include "f2fs.h"
+
+/*
+ * Hashing code copied from ext3
+ */
+#define DELTA 0x9E3779B9
+
+static void TEA_transform(unsigned int buf[4], unsigned int const in[])
+{
+ __u32 sum = 0;
+ __u32 b0 = buf[0], b1 = buf[1];
+ __u32 a = in[0], b = in[1], c = in[2], d = in[3];
+ int n = 16;
+
+ do {
+ sum += DELTA;
+ b0 += ((b1 << 4)+a) ^ (b1+sum) ^ ((b1 >> 5)+b);
+ b1 += ((b0 << 4)+c) ^ (b0+sum) ^ ((b0 >> 5)+d);
+ } while (--n);
+
+ buf[0] += b0;
+ buf[1] += b1;
+}
+
+static void str2hashbuf(const char *msg, int len, unsigned int *buf, int num)
+{
+ unsigned pad, val;
+ int i;
+
+ pad = (__u32)len | ((__u32)len << 8);
+ pad |= pad << 16;
+
+ val = pad;
+ if (len > num * 4)
+ len = num * 4;
+ for (i = 0; i < len; i++) {
+ if ((i % 4) == 0)
+ val = pad;
+ val = msg[i] + (val << 8);
+ if ((i % 4) == 3) {
+ *buf++ = val;
+ val = pad;
+ num--;
+ }
+ }
+ if (--num >= 0)
+ *buf++ = val;
+ while (--num >= 0)
+ *buf++ = pad;
+}
+
+f2fs_hash_t f2fs_dentry_hash(const char *name, int len)
+{
+ __u32 hash, minor_hash;
+ f2fs_hash_t f2fs_hash;
+ const char *p;
+ __u32 in[8], buf[4];
+
+ /* Initialize the default seed for the hash checksum functions */
+ buf[0] = 0x67452301;
+ buf[1] = 0xefcdab89;
+ buf[2] = 0x98badcfe;
+ buf[3] = 0x10325476;
+
+ p = name;
+ while (len > 0) {
+ str2hashbuf(p, len, in, 4);
+ TEA_transform(buf, in);
+ len -= 16;
+ p += 16;
+ }
+ hash = buf[0];
+ minor_hash = buf[1];
+
+ f2fs_hash = hash;
+ f2fs_hash &= ~F2FS_HASH_COL_BIT;
+ return f2fs_hash;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:48:13

[permalink] [raw]

Subject: [PATCH 13/17] f2fs: add xattr and acl functionalities

This implements xattr and acl functionalities.

- F2FS uses a node page to contain use extended attributes.

Signed-off-by: Changman Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/acl.c | 465 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/acl.h | 57 +++++++
fs/f2fs/xattr.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/xattr.h | 145 +++++++++++++++++
4 files changed, 1056 insertions(+)
create mode 100644 fs/f2fs/acl.c
create mode 100644 fs/f2fs/acl.h
create mode 100644 fs/f2fs/xattr.c
create mode 100644 fs/f2fs/xattr.h

diff --git a/fs/f2fs/acl.c b/fs/f2fs/acl.c
new file mode 100644
index 0000000..dff2a2b
--- /dev/null
+++ b/fs/f2fs/acl.c
@@ -0,0 +1,465 @@
+/**
+ * fs/f2fs/acl.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * Portions of this code from linux/fs/ext2/acl.c
+ *
+ * Copyright (C) 2001-2003 Andreas Gruenbacher, <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/f2fs_fs.h>
+#include "f2fs.h"
+#include "xattr.h"
+#include "acl.h"
+
+#define get_inode_mode(i) ((is_inode_flag_set(F2FS_I(i), FI_ACL_MODE)) ? \
+ (F2FS_I(i)->i_acl_mode) : ((i)->i_mode))
+
+static inline size_t f2fs_acl_size(int count)
+{
+ if (count <= 4) {
+ return sizeof(struct f2fs_acl_header) +
+ count * sizeof(struct f2fs_acl_entry_short);
+ } else {
+ return sizeof(struct f2fs_acl_header) +
+ 4 * sizeof(struct f2fs_acl_entry_short) +
+ (count - 4) * sizeof(struct f2fs_acl_entry);
+ }
+}
+
+static inline int f2fs_acl_count(size_t size)
+{
+ ssize_t s;
+ size -= sizeof(struct f2fs_acl_header);
+ s = size - 4 * sizeof(struct f2fs_acl_entry_short);
+ if (s < 0) {
+ if (size % sizeof(struct f2fs_acl_entry_short))
+ return -1;
+ return size / sizeof(struct f2fs_acl_entry_short);
+ } else {
+ if (s % sizeof(struct f2fs_acl_entry))
+ return -1;
+ return s / sizeof(struct f2fs_acl_entry) + 4;
+ }
+}
+
+static struct posix_acl *f2fs_acl_from_disk(const char *value, size_t size)
+{
+ int i, count;
+ struct posix_acl *acl;
+ struct f2fs_acl_header *hdr = (struct f2fs_acl_header *)value;
+ struct f2fs_acl_entry *entry = (struct f2fs_acl_entry *)(hdr + 1);
+ const char *end = value + size;
+
+ if (hdr->a_version != cpu_to_le32(F2FS_ACL_VERSION))
+ return ERR_PTR(-EINVAL);
+
+ count = f2fs_acl_count(size);
+ if (count < 0)
+ return ERR_PTR(-EINVAL);
+ if (count == 0)
+ return NULL;
+
+ acl = posix_acl_alloc(count, GFP_KERNEL);
+ if (!acl)
+ return ERR_PTR(-ENOMEM);
+
+ for (i = 0; i < count; i++) {
+
+ if ((char *)entry > end)
+ goto fail;
+
+ acl->a_entries[i].e_tag = le16_to_cpu(entry->e_tag);
+ acl->a_entries[i].e_perm = le16_to_cpu(entry->e_perm);
+
+ switch (acl->a_entries[i].e_tag) {
+ case ACL_USER_OBJ:
+ case ACL_GROUP_OBJ:
+ case ACL_MASK:
+ case ACL_OTHER:
+ acl->a_entries[i].e_id = ACL_UNDEFINED_ID;
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry_short));
+ break;
+
+ case ACL_USER:
+ acl->a_entries[i].e_uid =
+ make_kuid(&init_user_ns,
+ le32_to_cpu(entry->e_id));
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry));
+ break;
+ case ACL_GROUP:
+ acl->a_entries[i].e_gid =
+ make_kgid(&init_user_ns,
+ le32_to_cpu(entry->e_id));
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry));
+ break;
+ default:
+ goto fail;
+ }
+ }
+ if ((char *)entry != end)
+ goto fail;
+ return acl;
+fail:
+ posix_acl_release(acl);
+ return ERR_PTR(-EINVAL);
+}
+
+static void *f2fs_acl_to_disk(const struct posix_acl *acl, size_t *size)
+{
+ struct f2fs_acl_header *f2fs_acl;
+ struct f2fs_acl_entry *entry;
+ int i;
+
+ f2fs_acl = kmalloc(sizeof(struct f2fs_acl_header) + acl->a_count *
+ sizeof(struct f2fs_acl_entry), GFP_KERNEL);
+ if (!f2fs_acl)
+ return ERR_PTR(-ENOMEM);
+
+ f2fs_acl->a_version = cpu_to_le32(F2FS_ACL_VERSION);
+ entry = (struct f2fs_acl_entry *)(f2fs_acl + 1);
+
+ for (i = 0; i < acl->a_count; i++) {
+
+ entry->e_tag = cpu_to_le16(acl->a_entries[i].e_tag);
+ entry->e_perm = cpu_to_le16(acl->a_entries[i].e_perm);
+
+ switch (acl->a_entries[i].e_tag) {
+ case ACL_USER:
+ entry->e_id = cpu_to_le32(
+ from_kuid(&init_user_ns,
+ acl->a_entries[i].e_uid));
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry));
+ break;
+ case ACL_GROUP:
+ entry->e_id = cpu_to_le32(
+ from_kgid(&init_user_ns,
+ acl->a_entries[i].e_gid));
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry));
+ break;
+ case ACL_USER_OBJ:
+ case ACL_GROUP_OBJ:
+ case ACL_MASK:
+ case ACL_OTHER:
+ entry = (struct f2fs_acl_entry *)((char *)entry +
+ sizeof(struct f2fs_acl_entry_short));
+ break;
+ default:
+ goto fail;
+ }
+ }
+ *size = f2fs_acl_size(acl->a_count);
+ return (void *)f2fs_acl;
+
+fail:
+ kfree(f2fs_acl);
+ return ERR_PTR(-EINVAL);
+}
+
+struct posix_acl *f2fs_get_acl(struct inode *inode, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ int name_index = F2FS_XATTR_INDEX_POSIX_ACL_DEFAULT;
+ void *value = NULL;
+ struct posix_acl *acl;
+ int retval;
+
+ if (!test_opt(sbi, POSIX_ACL))
+ return NULL;
+
+ acl = get_cached_acl(inode, type);
+ if (acl != ACL_NOT_CACHED)
+ return acl;
+
+ if (type == ACL_TYPE_ACCESS)
+ name_index = F2FS_XATTR_INDEX_POSIX_ACL_ACCESS;
+
+ retval = f2fs_getxattr(inode, name_index, "", NULL, 0);
+ if (retval > 0) {
+ value = kmalloc(retval, GFP_KERNEL);
+ if (!value)
+ return ERR_PTR(-ENOMEM);
+ retval = f2fs_getxattr(inode, name_index, "", value, retval);
+ }
+
+ if (retval < 0) {
+ if (retval == -ENODATA)
+ acl = NULL;
+ else
+ acl = ERR_PTR(retval);
+ } else {
+ acl = f2fs_acl_from_disk(value, retval);
+ }
+ kfree(value);
+ if (!IS_ERR(acl))
+ set_cached_acl(inode, type, acl);
+
+ return acl;
+}
+
+static int f2fs_set_acl(struct inode *inode, int type, struct posix_acl *acl)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ int name_index;
+ void *value = NULL;
+ size_t size = 0;
+ int error;
+
+ if (!test_opt(sbi, POSIX_ACL))
+ return 0;
+ if (S_ISLNK(inode->i_mode))
+ return -EOPNOTSUPP;
+
+ switch (type) {
+ case ACL_TYPE_ACCESS:
+ name_index = F2FS_XATTR_INDEX_POSIX_ACL_ACCESS;
+ if (acl) {
+ error = posix_acl_equiv_mode(acl, &inode->i_mode);
+ if (error < 0)
+ return error;
+ set_acl_inode(fi, inode->i_mode);
+ if (error == 0)
+ acl = NULL;
+ }
+ break;
+
+ case ACL_TYPE_DEFAULT:
+ name_index = F2FS_XATTR_INDEX_POSIX_ACL_DEFAULT;
+ if (!S_ISDIR(inode->i_mode))
+ return acl ? -EACCES : 0;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
+ if (acl) {
+ value = f2fs_acl_to_disk(acl, &size);
+ if (IS_ERR(value)) {
+ cond_clear_inode_flag(fi, FI_ACL_MODE);
+ return (int)PTR_ERR(value);
+ }
+ }
+
+ error = f2fs_setxattr(inode, name_index, "", value, size);
+
+ kfree(value);
+ if (!error)
+ set_cached_acl(inode, type, acl);
+
+ cond_clear_inode_flag(fi, FI_ACL_MODE);
+ return error;
+}
+
+int f2fs_init_acl(struct inode *inode, struct inode *dir)
+{
+ struct posix_acl *acl = NULL;
+ struct f2fs_sb_info *sbi = F2FS_SB(dir->i_sb);
+ int error = 0;
+
+ if (!S_ISLNK(inode->i_mode)) {
+ if (test_opt(sbi, POSIX_ACL)) {
+ acl = f2fs_get_acl(dir, ACL_TYPE_DEFAULT);
+ if (IS_ERR(acl))
+ return PTR_ERR(acl);
+ }
+ if (!acl)
+ inode->i_mode &= ~current_umask();
+ }
+
+ if (test_opt(sbi, POSIX_ACL) && acl) {
+
+ if (S_ISDIR(inode->i_mode)) {
+ error = f2fs_set_acl(inode, ACL_TYPE_DEFAULT, acl);
+ if (error)
+ goto cleanup;
+ }
+ error = posix_acl_create(&acl, GFP_KERNEL, &inode->i_mode);
+ if (error < 0)
+ return error;
+ if (error > 0)
+ error = f2fs_set_acl(inode, ACL_TYPE_ACCESS, acl);
+ }
+cleanup:
+ posix_acl_release(acl);
+ return error;
+}
+
+int f2fs_acl_chmod(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct posix_acl *acl;
+ int error;
+ mode_t mode = get_inode_mode(inode);
+
+ if (!test_opt(sbi, POSIX_ACL))
+ return 0;
+ if (S_ISLNK(mode))
+ return -EOPNOTSUPP;
+
+ acl = f2fs_get_acl(inode, ACL_TYPE_ACCESS);
+ if (IS_ERR(acl) || !acl)
+ return PTR_ERR(acl);
+
+ error = posix_acl_chmod(&acl, GFP_KERNEL, mode);
+ if (error)
+ return error;
+ error = f2fs_set_acl(inode, ACL_TYPE_ACCESS, acl);
+ posix_acl_release(acl);
+ return error;
+}
+
+static size_t f2fs_xattr_list_acl(struct dentry *dentry, char *list,
+ size_t list_size, const char *name, size_t name_len, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+ const char *xname = POSIX_ACL_XATTR_DEFAULT;
+ size_t size;
+
+ if (!test_opt(sbi, POSIX_ACL))
+ return 0;
+
+ if (type == ACL_TYPE_ACCESS)
+ xname = POSIX_ACL_XATTR_ACCESS;
+
+ size = strlen(xname) + 1;
+ if (list && size <= list_size)
+ memcpy(list, xname, size);
+ return size;
+}
+
+static int f2fs_xattr_get_acl(struct dentry *dentry, const char *name,
+ void *buffer, size_t size, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+ struct posix_acl *acl;
+ int error;
+
+ if (strcmp(name, "") != 0)
+ return -EINVAL;
+ if (!test_opt(sbi, POSIX_ACL))
+ return -EOPNOTSUPP;
+
+ acl = f2fs_get_acl(dentry->d_inode, type);
+ if (IS_ERR(acl))
+ return PTR_ERR(acl);
+ if (!acl)
+ return -ENODATA;
+ error = posix_acl_to_xattr(&init_user_ns, acl, buffer, size);
+ posix_acl_release(acl);
+
+ return error;
+}
+
+static int f2fs_xattr_set_acl(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+ struct inode *inode = dentry->d_inode;
+ struct posix_acl *acl = NULL;
+ int error;
+
+ if (strcmp(name, "") != 0)
+ return -EINVAL;
+ if (!test_opt(sbi, POSIX_ACL))
+ return -EOPNOTSUPP;
+ if (!inode_owner_or_capable(inode))
+ return -EPERM;
+
+ if (value) {
+ acl = posix_acl_from_xattr(&init_user_ns, value, size);
+ if (IS_ERR(acl))
+ return PTR_ERR(acl);
+ if (acl) {
+ error = posix_acl_valid(acl);
+ if (error)
+ goto release_and_out;
+ }
+ } else {
+ acl = NULL;
+ }
+
+ error = f2fs_set_acl(inode, type, acl);
+
+release_and_out:
+ posix_acl_release(acl);
+ return error;
+}
+
+const struct xattr_handler f2fs_xattr_acl_default_handler = {
+ .prefix = POSIX_ACL_XATTR_DEFAULT,
+ .flags = ACL_TYPE_DEFAULT,
+ .list = f2fs_xattr_list_acl,
+ .get = f2fs_xattr_get_acl,
+ .set = f2fs_xattr_set_acl,
+};
+
+const struct xattr_handler f2fs_xattr_acl_access_handler = {
+ .prefix = POSIX_ACL_XATTR_ACCESS,
+ .flags = ACL_TYPE_ACCESS,
+ .list = f2fs_xattr_list_acl,
+ .get = f2fs_xattr_get_acl,
+ .set = f2fs_xattr_set_acl,
+};
+
+static size_t f2fs_xattr_advise_list(struct dentry *dentry, char *list,
+ size_t list_size, const char *name, size_t name_len, int type)
+{
+ const char *xname = F2FS_SYSTEM_ADVISE_PREFIX;
+ size_t size;
+
+ if (type != F2FS_XATTR_INDEX_ADVISE)
+ return 0;
+
+ size = strlen(xname) + 1;
+ if (list && size <= list_size)
+ memcpy(list, xname, size);
+ return size;
+}
+
+static int f2fs_xattr_advise_get(struct dentry *dentry, const char *name,
+ void *buffer, size_t size, int type)
+{
+ struct inode *inode = dentry->d_inode;
+
+ if (strcmp(name, "") != 0)
+ return -EINVAL;
+
+ *((char *)buffer) = F2FS_I(inode)->i_advise;
+ return sizeof(char);
+}
+
+static int f2fs_xattr_advise_set(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags, int type)
+{
+ struct inode *inode = dentry->d_inode;
+
+ if (strcmp(name, "") != 0)
+ return -EINVAL;
+ if (!inode_owner_or_capable(inode))
+ return -EPERM;
+ if (value == NULL)
+ return -EINVAL;
+
+ F2FS_I(inode)->i_advise |= *(char *)value;
+ return 0;
+}
+
+const struct xattr_handler f2fs_xattr_advise_handler = {
+ .prefix = F2FS_SYSTEM_ADVISE_PREFIX,
+ .flags = F2FS_XATTR_INDEX_ADVISE,
+ .list = f2fs_xattr_advise_list,
+ .get = f2fs_xattr_advise_get,
+ .set = f2fs_xattr_advise_set,
+};
diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h
new file mode 100644
index 0000000..c97675e
--- /dev/null
+++ b/fs/f2fs/acl.h
@@ -0,0 +1,57 @@
+/**
+ * fs/f2fs/acl.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * Portions of this code from linux/fs/ext2/acl.h
+ *
+ * Copyright (C) 2001-2003 Andreas Gruenbacher, <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef __F2FS_ACL_H__
+#define __F2FS_ACL_H__
+
+#include <linux/posix_acl_xattr.h>
+
+#define F2FS_ACL_VERSION 0x0001
+
+struct f2fs_acl_entry {
+ __le16 e_tag;
+ __le16 e_perm;
+ __le32 e_id;
+};
+
+struct f2fs_acl_entry_short {
+ __le16 e_tag;
+ __le16 e_perm;
+};
+
+struct f2fs_acl_header {
+ __le32 a_version;
+};
+
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+
+extern struct posix_acl *f2fs_get_acl(struct inode *inode, int type);
+extern int f2fs_acl_chmod(struct inode *inode);
+extern int f2fs_init_acl(struct inode *inode, struct inode *dir);
+#else
+#define f2fs_check_acl NULL
+#define f2fs_get_acl NULL
+#define f2fs_set_acl NULL
+
+static inline int f2fs_acl_chmod(struct inode *inode)
+{
+ return 0;
+}
+
+static inline int f2fs_init_acl(struct inode *inode, struct inode *dir)
+{
+ return 0;
+}
+#endif
+#endif /* __F2FS_ACL_H__ */
diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
new file mode 100644
index 0000000..aca50fe
--- /dev/null
+++ b/fs/f2fs/xattr.c
@@ -0,0 +1,389 @@
+/**
+ * fs/f2fs/xattr.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * Portions of this code from linux/fs/ext2/xattr.c
+ *
+ * Copyright (C) 2001-2003 Andreas Gruenbacher <[email protected]>
+ *
+ * Fix by Harrison Xing <[email protected]>.
+ * Extended attributes for symlinks and special files added per
+ * suggestion of Luka Renko <[email protected]>.
+ * xattr consolidation Copyright (c) 2004 James Morris <[email protected]>,
+ * Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/rwsem.h>
+#include <linux/f2fs_fs.h>
+#include "f2fs.h"
+#include "xattr.h"
+
+static size_t f2fs_xattr_generic_list(struct dentry *dentry, char *list,
+ size_t list_size, const char *name, size_t name_len, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+ int total_len, prefix_len = 0;
+ const char *prefix = NULL;
+
+ switch (type) {
+ case F2FS_XATTR_INDEX_USER:
+ if (!test_opt(sbi, XATTR_USER))
+ return -EOPNOTSUPP;
+ prefix = XATTR_USER_PREFIX;
+ prefix_len = XATTR_USER_PREFIX_LEN;
+ break;
+ case F2FS_XATTR_INDEX_TRUSTED:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ prefix = XATTR_TRUSTED_PREFIX;
+ prefix_len = XATTR_TRUSTED_PREFIX_LEN;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ total_len = prefix_len + name_len + 1;
+ if (list && total_len <= list_size) {
+ memcpy(list, prefix, prefix_len);
+ memcpy(list+prefix_len, name, name_len);
+ list[prefix_len + name_len] = '\0';
+ }
+ return total_len;
+}
+
+static int f2fs_xattr_generic_get(struct dentry *dentry, const char *name,
+ void *buffer, size_t size, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+
+ switch (type) {
+ case F2FS_XATTR_INDEX_USER:
+ if (!test_opt(sbi, XATTR_USER))
+ return -EOPNOTSUPP;
+ break;
+ case F2FS_XATTR_INDEX_TRUSTED:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ break;
+ default:
+ return -EINVAL;
+ }
+ if (strcmp(name, "") == 0)
+ return -EINVAL;
+ return f2fs_getxattr(dentry->d_inode, type, name,
+ buffer, size);
+}
+
+static int f2fs_xattr_generic_set(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags, int type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(dentry->d_sb);
+
+ switch (type) {
+ case F2FS_XATTR_INDEX_USER:
+ if (!test_opt(sbi, XATTR_USER))
+ return -EOPNOTSUPP;
+ break;
+ case F2FS_XATTR_INDEX_TRUSTED:
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ break;
+ default:
+ return -EINVAL;
+ }
+ if (strcmp(name, "") == 0)
+ return -EINVAL;
+
+ return f2fs_setxattr(dentry->d_inode, type, name, value, size);
+}
+
+const struct xattr_handler f2fs_xattr_user_handler = {
+ .prefix = XATTR_USER_PREFIX,
+ .flags = F2FS_XATTR_INDEX_USER,
+ .list = f2fs_xattr_generic_list,
+ .get = f2fs_xattr_generic_get,
+ .set = f2fs_xattr_generic_set,
+};
+
+const struct xattr_handler f2fs_xattr_trusted_handler = {
+ .prefix = XATTR_TRUSTED_PREFIX,
+ .flags = F2FS_XATTR_INDEX_TRUSTED,
+ .list = f2fs_xattr_generic_list,
+ .get = f2fs_xattr_generic_get,
+ .set = f2fs_xattr_generic_set,
+};
+
+static const struct xattr_handler *f2fs_xattr_handler_map[] = {
+ [F2FS_XATTR_INDEX_USER] = &f2fs_xattr_user_handler,
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+ [F2FS_XATTR_INDEX_POSIX_ACL_ACCESS] = &f2fs_xattr_acl_access_handler,
+ [F2FS_XATTR_INDEX_POSIX_ACL_DEFAULT] = &f2fs_xattr_acl_default_handler,
+#endif
+ [F2FS_XATTR_INDEX_TRUSTED] = &f2fs_xattr_trusted_handler,
+ [F2FS_XATTR_INDEX_ADVISE] = &f2fs_xattr_advise_handler,
+};
+
+const struct xattr_handler *f2fs_xattr_handlers[] = {
+ &f2fs_xattr_user_handler,
+#ifdef CONFIG_F2FS_FS_POSIX_ACL
+ &f2fs_xattr_acl_access_handler,
+ &f2fs_xattr_acl_default_handler,
+#endif
+ &f2fs_xattr_trusted_handler,
+ &f2fs_xattr_advise_handler,
+ NULL,
+};
+
+static inline const struct xattr_handler *f2fs_xattr_handler(int name_index)
+{
+ const struct xattr_handler *handler = NULL;
+
+ if (name_index > 0 && name_index < ARRAY_SIZE(f2fs_xattr_handler_map))
+ handler = f2fs_xattr_handler_map[name_index];
+ return handler;
+}
+
+int f2fs_getxattr(struct inode *inode, int name_index, const char *name,
+ void *buffer, size_t buffer_size)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct f2fs_xattr_entry *entry;
+ struct page *page;
+ void *base_addr;
+ int error = 0, found = 0;
+ int value_len, name_len;
+
+ if (name == NULL)
+ return -EINVAL;
+ name_len = strlen(name);
+
+ if (!fi->i_xattr_nid)
+ return -ENODATA;
+
+ page = get_node_page(sbi, fi->i_xattr_nid);
+ base_addr = page_address(page);
+
+ list_for_each_xattr(entry, base_addr) {
+ if (entry->e_name_index != name_index)
+ continue;
+ if (entry->e_name_len != name_len)
+ continue;
+ if (!memcmp(entry->e_name, name, name_len)) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found) {
+ error = -ENODATA;
+ goto cleanup;
+ }
+
+ value_len = le16_to_cpu(entry->e_value_size);
+
+ if (buffer && value_len > buffer_size) {
+ error = -ERANGE;
+ goto cleanup;
+ }
+
+ if (buffer) {
+ char *pval = entry->e_name + entry->e_name_len;
+ memcpy(buffer, pval, value_len);
+ }
+ error = value_len;
+
+cleanup:
+ f2fs_put_page(page, 1);
+ return error;
+}
+
+ssize_t f2fs_listxattr(struct dentry *dentry, char *buffer, size_t buffer_size)
+{
+ struct inode *inode = dentry->d_inode;
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct f2fs_xattr_entry *entry;
+ struct page *page;
+ void *base_addr;
+ int error = 0;
+ size_t rest = buffer_size;
+
+ if (!fi->i_xattr_nid)
+ return 0;
+
+ page = get_node_page(sbi, fi->i_xattr_nid);
+ base_addr = page_address(page);
+
+ list_for_each_xattr(entry, base_addr) {
+ const struct xattr_handler *handler =
+ f2fs_xattr_handler(entry->e_name_index);
+ size_t size;
+
+ if (!handler)
+ continue;
+
+ size = handler->list(dentry, buffer, rest, entry->e_name,
+ entry->e_name_len, handler->flags);
+ if (buffer && size > rest) {
+ error = -ERANGE;
+ goto cleanup;
+ }
+
+ if (buffer)
+ buffer += size;
+ rest -= size;
+ }
+ error = buffer_size - rest;
+cleanup:
+ f2fs_put_page(page, 1);
+ return error;
+}
+
+int f2fs_setxattr(struct inode *inode, int name_index, const char *name,
+ const void *value, size_t value_len)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct f2fs_xattr_header *header = NULL;
+ struct f2fs_xattr_entry *here, *last;
+ struct page *page;
+ void *base_addr;
+ int error, found, free, name_len, newsize;
+ char *pval;
+
+ if (name == NULL)
+ return -EINVAL;
+ name_len = strlen(name);
+
+ if (value == NULL)
+ value_len = 0;
+
+ if (name_len > 255 || value_len > MAX_VALUE_LEN)
+ return -ERANGE;
+
+ mutex_lock_op(sbi, NODE_NEW);
+ if (!fi->i_xattr_nid) {
+ /* Allocate new attribute block */
+ struct dnode_of_data dn;
+
+ if (!alloc_nid(sbi, &fi->i_xattr_nid)) {
+ mutex_unlock_op(sbi, NODE_NEW);
+ return -ENOSPC;
+ }
+ set_new_dnode(&dn, inode, NULL, NULL, fi->i_xattr_nid);
+ mark_inode_dirty(inode);
+
+ page = new_node_page(&dn, XATTR_NODE_OFFSET);
+ if (IS_ERR(page)) {
+ alloc_nid_failed(sbi, fi->i_xattr_nid);
+ fi->i_xattr_nid = 0;
+ mutex_unlock_op(sbi, NODE_NEW);
+ return PTR_ERR(page);
+ }
+
+ alloc_nid_done(sbi, fi->i_xattr_nid);
+ base_addr = page_address(page);
+ header = XATTR_HDR(base_addr);
+ header->h_magic = cpu_to_le32(F2FS_XATTR_MAGIC);
+ header->h_refcount = cpu_to_le32(1);
+ } else {
+ /* The inode already has an extended attribute block. */
+ page = get_node_page(sbi, fi->i_xattr_nid);
+ if (IS_ERR(page)) {
+ mutex_unlock_op(sbi, NODE_NEW);
+ return PTR_ERR(page);
+ }
+
+ base_addr = page_address(page);
+ header = XATTR_HDR(base_addr);
+ }
+
+ if (le32_to_cpu(header->h_magic) != F2FS_XATTR_MAGIC) {
+ error = -EIO;
+ goto cleanup;
+ }
+
+ /* find entry with wanted name. */
+ found = 0;
+ list_for_each_xattr(here, base_addr) {
+ if (here->e_name_index != name_index)
+ continue;
+ if (here->e_name_len != name_len)
+ continue;
+ if (!memcmp(here->e_name, name, name_len)) {
+ found = 1;
+ break;
+ }
+ }
+
+ last = here;
+
+ while (!IS_XATTR_LAST_ENTRY(last))
+ last = XATTR_NEXT_ENTRY(last);
+
+ newsize = XATTR_ALIGN(sizeof(struct f2fs_xattr_entry) +
+ name_len + value_len);
+
+ /* 1. Check space */
+ if (value) {
+ /* If value is NULL, it is remove operation.
+ * In case of update operation, we caculate free.
+ */
+ free = MIN_OFFSET - ((char *)last - (char *)header);
+ if (found)
+ free = free - ENTRY_SIZE(here);
+
+ if (free < newsize) {
+ error = -ENOSPC;
+ goto cleanup;
+ }
+ }
+
+ /* 2. Remove old entry */
+ if (found) {
+ /* If entry is found, remove old entry.
+ * If not found, remove operation is not needed.
+ */
+ struct f2fs_xattr_entry *next = XATTR_NEXT_ENTRY(here);
+ int oldsize = ENTRY_SIZE(here);
+
+ memmove(here, next, (char *)last - (char *)next);
+ last = (struct f2fs_xattr_entry *)((char *)last - oldsize);
+ memset(last, 0, oldsize);
+ }
+
+ /* 3. Write new entry */
+ if (value) {
+ /* Before we come here, old entry is removed.
+ * We just write new entry. */
+ memset(last, 0, newsize);
+ last->e_name_index = name_index;
+ last->e_name_len = name_len;
+ memcpy(last->e_name, name, name_len);
+ pval = last->e_name + name_len;
+ memcpy(pval, value, value_len);
+ last->e_value_size = cpu_to_le16(value_len);
+ }
+
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+
+ if (is_inode_flag_set(fi, FI_ACL_MODE)) {
+ inode->i_mode = fi->i_acl_mode;
+ inode->i_ctime = CURRENT_TIME;
+ clear_inode_flag(fi, FI_ACL_MODE);
+ }
+ f2fs_write_inode(inode, NULL);
+ mutex_unlock_op(sbi, NODE_NEW);
+
+ return 0;
+cleanup:
+ f2fs_put_page(page, 1);
+ mutex_unlock_op(sbi, NODE_NEW);
+ return error;
+}
diff --git a/fs/f2fs/xattr.h b/fs/f2fs/xattr.h
new file mode 100644
index 0000000..29b0a08
--- /dev/null
+++ b/fs/f2fs/xattr.h
@@ -0,0 +1,145 @@
+/**
+ * fs/f2fs/xattr.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * Portions of this code from linux/fs/ext2/xattr.h
+ *
+ * On-disk format of extended attributes for the ext2 filesystem.
+ *
+ * (C) 2001 Andreas Gruenbacher, <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef __F2FS_XATTR_H__
+#define __F2FS_XATTR_H__
+
+#include <linux/init.h>
+#include <linux/xattr.h>
+
+/* Magic value in attribute blocks */
+#define F2FS_XATTR_MAGIC 0xF2F52011
+
+/* Maximum number of references to one attribute block */
+#define F2FS_XATTR_REFCOUNT_MAX 1024
+
+/* Name indexes */
+#define F2FS_SYSTEM_ADVISE_PREFIX "system.advise"
+#define F2FS_XATTR_INDEX_USER 1
+#define F2FS_XATTR_INDEX_POSIX_ACL_ACCESS 2
+#define F2FS_XATTR_INDEX_POSIX_ACL_DEFAULT 3
+#define F2FS_XATTR_INDEX_TRUSTED 4
+#define F2FS_XATTR_INDEX_LUSTRE 5
+#define F2FS_XATTR_INDEX_SECURITY 6
+#define F2FS_XATTR_INDEX_ADVISE 7
+
+struct f2fs_xattr_header {
+ __le32 h_magic; /* magic number for identification */
+ __le32 h_refcount; /* reference count */
+ __u32 h_reserved[4]; /* zero right now */
+};
+
+struct f2fs_xattr_entry {
+ __u8 e_name_index;
+ __u8 e_name_len;
+ __le16 e_value_size; /* size of attribute value */
+ char e_name[0]; /* attribute name */
+};
+
+#define XATTR_HDR(ptr) ((struct f2fs_xattr_header *)(ptr))
+#define XATTR_ENTRY(ptr) ((struct f2fs_xattr_entry *)(ptr))
+#define XATTR_FIRST_ENTRY(ptr) (XATTR_ENTRY(XATTR_HDR(ptr)+1))
+#define XATTR_ROUND (3)
+
+#define XATTR_ALIGN(size) ((size + XATTR_ROUND) & ~XATTR_ROUND)
+
+#define ENTRY_SIZE(entry) (XATTR_ALIGN(sizeof(struct f2fs_xattr_entry) + \
+ entry->e_name_len + le16_to_cpu(entry->e_value_size)))
+
+#define XATTR_NEXT_ENTRY(entry) ((struct f2fs_xattr_entry *)((char *)(entry) +\
+ ENTRY_SIZE(entry)))
+
+#define IS_XATTR_LAST_ENTRY(entry) (*(__u32 *)(entry) == 0)
+
+#define list_for_each_xattr(entry, addr) \
+ for (entry = XATTR_FIRST_ENTRY(addr);\
+ !IS_XATTR_LAST_ENTRY(entry);\
+ entry = XATTR_NEXT_ENTRY(entry))
+
+
+#define MIN_OFFSET XATTR_ALIGN(PAGE_SIZE - \
+ sizeof(struct node_footer) - \
+ sizeof(__u32))
+
+#define MAX_VALUE_LEN (MIN_OFFSET - sizeof(struct f2fs_xattr_header) - \
+ sizeof(struct f2fs_xattr_entry))
+
+/**
+ * On-disk structure of f2fs_xattr
+ * We use only 1 block for xattr.
+ *
+ * +--------------------+
+ * | f2fs_xattr_header |
+ * | |
+ * +--------------------+
+ * | f2fs_xattr_entry |
+ * | .e_name_index = 1 |
+ * | .e_name_len = 3 |
+ * | .e_value_size = 14 |
+ * | .e_name = "foo" |
+ * | "value_of_xattr" |<- value_offs = e_name + e_name_len
+ * +--------------------+
+ * | f2fs_xattr_entry |
+ * | .e_name_index = 4 |
+ * | .e_name = "bar" |
+ * +--------------------+
+ * | |
+ * | Free |
+ * | |
+ * +--------------------+<- MIN_OFFSET
+ * | node_footer |
+ * | (nid, ino, offset) |
+ * +--------------------+
+ *
+ **/
+
+#ifdef CONFIG_F2FS_FS_XATTR
+extern const struct xattr_handler f2fs_xattr_user_handler;
+extern const struct xattr_handler f2fs_xattr_trusted_handler;
+extern const struct xattr_handler f2fs_xattr_acl_access_handler;
+extern const struct xattr_handler f2fs_xattr_acl_default_handler;
+extern const struct xattr_handler f2fs_xattr_advise_handler;
+
+extern const struct xattr_handler *f2fs_xattr_handlers[];
+
+extern int f2fs_setxattr(struct inode *inode, int name_index, const char *name,
+ const void *value, size_t value_len);
+extern int f2fs_getxattr(struct inode *inode, int name_index, const char *name,
+ void *buffer, size_t buffer_size);
+extern ssize_t f2fs_listxattr(struct dentry *dentry, char *buffer,
+ size_t buffer_size);
+
+#else
+
+#define f2fs_xattr_handlers NULL
+static inline int f2fs_setxattr(struct inode *inode, int name_index,
+ const char *name, const void *value, size_t value_len)
+{
+ return -EOPNOTSUPP;
+}
+static inline int f2fs_getxattr(struct inode *inode, int name_index,
+ const char *name, void *buffer, size_t buffer_size)
+{
+ return -EOPNOTSUPP;
+}
+static inline ssize_t f2fs_listxattr(struct dentry *dentry, char *buffer,
+ size_t buffer_size)
+{
+ return -EOPNOTSUPP;
+}
+#endif
+
+#endif /* __F2FS_XATTR_H__ */
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:48:47

[permalink] [raw]

Subject: [PATCH 14/17] f2fs: add garbage collection functions

This adds on-demand and background cleaning functions.

- The basic background cleaning policy is trying to do cleaning jobs as much as
possible whenever the system is idle. Once the background cleaning is done,
the cleaner sleeps an amount of time not to interfere with VFS calls. The time
is dynamically adjusted according to the status of whole segments, which is
decreased when the following conditions are satisfied.

. GC is not conducted currently, and
. IO subsystem is idle by checking the number of requets in bdev's request
list, and
. There are enough dirty segments.

Otherwise, the time is increased incrementally until to the maximum time.
Note that, min and max times are 10 secs and 30 secs by default.

- F2FS adopts a default victim selection policy where background cleaning uses
a cost-benefit algorithm, while on-demand cleaning uses a greedy algorithm.

- The method of moving data during the cleaning is slightly different between
background and on-demand cleaning schemes. In the case of background cleaning,
F2FS loads the data, and marks them as dirty. Then, F2FS expects that the data
will be moved by flusher or VM. In the case of on-demand cleaning, F2FS should
move the data right away.

- In order to identify valid blocks in a victim segment, F2FS scans the bitmap
of the segment managed as an SIT entry.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/gc.c | 742 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/gc.h | 117 +++++++++
2 files changed, 859 insertions(+)
create mode 100644 fs/f2fs/gc.c
create mode 100644 fs/f2fs/gc.h

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
new file mode 100644
index 0000000..46774ce
--- /dev/null
+++ b/fs/f2fs/gc.c
@@ -0,0 +1,742 @@
+/**
+ * fs/f2fs/gc.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/init.h>
+#include <linux/f2fs_fs.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/freezer.h>
+#include <linux/blkdev.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+#include "gc.h"
+
+static struct kmem_cache *winode_slab;
+
+static int gc_thread_func(void *data)
+{
+ struct f2fs_sb_info *sbi = data;
+ wait_queue_head_t *wq = &sbi->gc_thread->gc_wait_queue_head;
+ long wait_ms;
+
+ wait_ms = GC_THREAD_MIN_SLEEP_TIME;
+
+ do {
+ if (try_to_freeze())
+ continue;
+ else
+ wait_event_interruptible_timeout(*wq,
+ kthread_should_stop(),
+ msecs_to_jiffies(wait_ms));
+ if (kthread_should_stop())
+ break;
+
+ f2fs_balance_fs(sbi);
+
+ if (!test_opt(sbi, BG_GC))
+ continue;
+
+ /*
+ * [GC triggering condition]
+ * 0. GC is not conducted currently.
+ * 1. There are enough dirty segments.
+ * 2. IO subsystem is idle by checking the # of writeback pages.
+ * 3. IO subsystem is idle by checking the # of requests in
+ * bdev's request list.
+ *
+ * Note) We have to avoid triggering GCs too much frequently.
+ * Because it is possible that some segments can be
+ * invalidated soon after by user update or deletion.
+ * So, I'd like to wait some time to collect dirty segments.
+ */
+ if (!mutex_trylock(&sbi->gc_mutex))
+ continue;
+
+ if (!is_idle(sbi)) {
+ wait_ms = increase_sleep_time(wait_ms);
+ mutex_unlock(&sbi->gc_mutex);
+ continue;
+ }
+
+ if (has_enough_invalid_blocks(sbi))
+ wait_ms = decrease_sleep_time(wait_ms);
+ else
+ wait_ms = increase_sleep_time(wait_ms);
+
+ sbi->bg_gc++;
+
+ if (f2fs_gc(sbi, 1) == GC_NONE)
+ wait_ms = GC_THREAD_NOGC_SLEEP_TIME;
+ else if (wait_ms == GC_THREAD_NOGC_SLEEP_TIME)
+ wait_ms = GC_THREAD_MAX_SLEEP_TIME;
+
+ } while (!kthread_should_stop());
+ return 0;
+}
+
+int start_gc_thread(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_gc_kthread *gc_th = NULL;
+
+ gc_th = kmalloc(sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
+ if (!gc_th)
+ return -ENOMEM;
+
+ sbi->gc_thread = gc_th;
+ init_waitqueue_head(&sbi->gc_thread->gc_wait_queue_head);
+ sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
+ GC_THREAD_NAME);
+ if (IS_ERR(gc_th->f2fs_gc_task)) {
+ kfree(gc_th);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void stop_gc_thread(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
+ if (!gc_th)
+ return;
+ kthread_stop(gc_th->f2fs_gc_task);
+ kfree(gc_th);
+ sbi->gc_thread = NULL;
+}
+
+static int select_gc_type(int gc_type)
+{
+ return (gc_type == BG_GC) ? GC_CB : GC_GREEDY;
+}
+
+static void select_policy(struct f2fs_sb_info *sbi, int gc_type,
+ int type, struct victim_sel_policy *p)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+
+ if (p->alloc_mode) {
+ p->gc_mode = GC_GREEDY;
+ p->dirty_segmap = dirty_i->dirty_segmap[type];
+ p->ofs_unit = 1;
+ } else {
+ p->gc_mode = select_gc_type(gc_type);
+ p->dirty_segmap = dirty_i->dirty_segmap[DIRTY];
+ p->ofs_unit = sbi->segs_per_sec;
+ }
+ p->offset = sbi->last_victim[p->gc_mode];
+}
+
+static unsigned int get_max_cost(struct f2fs_sb_info *sbi,
+ struct victim_sel_policy *p)
+{
+ if (p->gc_mode == GC_GREEDY)
+ return (1 << sbi->log_blocks_per_seg) * p->ofs_unit;
+ else if (p->gc_mode == GC_CB)
+ return UINT_MAX;
+ else /* No other gc_mode */
+ return 0;
+}
+
+static unsigned int check_bg_victims(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned int segno;
+
+ /*
+ * If the gc_type is FG_GC, we can select victim segments
+ * selected by background GC before.
+ * Those segments guarantee they have small valid blocks.
+ */
+ segno = find_next_bit(dirty_i->victim_segmap[BG_GC],
+ TOTAL_SEGS(sbi), 0);
+ if (segno < TOTAL_SEGS(sbi)) {
+ clear_bit(segno, dirty_i->victim_segmap[BG_GC]);
+ return segno;
+ }
+ return NULL_SEGNO;
+}
+
+static unsigned int get_cb_cost(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int secno = GET_SECNO(sbi, segno);
+ unsigned int start = secno * sbi->segs_per_sec;
+ unsigned long long mtime = 0;
+ unsigned int vblocks;
+ unsigned char age = 0;
+ unsigned char u;
+ unsigned int i;
+
+ for (i = 0; i < sbi->segs_per_sec; i++)
+ mtime += get_seg_entry(sbi, start + i)->mtime;
+ vblocks = get_valid_blocks(sbi, segno, sbi->segs_per_sec);
+
+ mtime = div_u64(mtime, sbi->segs_per_sec);
+ vblocks = div_u64(vblocks, sbi->segs_per_sec);
+
+ u = (vblocks * 100) >> sbi->log_blocks_per_seg;
+
+ /* Handle if the system time is changed by user */
+ if (mtime < sit_i->min_mtime)
+ sit_i->min_mtime = mtime;
+ if (mtime > sit_i->max_mtime)
+ sit_i->max_mtime = mtime;
+ if (sit_i->max_mtime != sit_i->min_mtime)
+ age = 100 - div64_u64(100 * (mtime - sit_i->min_mtime),
+ sit_i->max_mtime - sit_i->min_mtime);
+
+ return UINT_MAX - ((100 * (100 - u) * age) / (100 + u));
+}
+
+static unsigned int get_gc_cost(struct f2fs_sb_info *sbi, unsigned int segno,
+ struct victim_sel_policy *p)
+{
+ if (p->alloc_mode == SSR)
+ return get_seg_entry(sbi, segno)->ckpt_valid_blocks;
+
+ /* alloc_mode == LFS */
+ if (p->gc_mode == GC_GREEDY)
+ return get_valid_blocks(sbi, segno, sbi->segs_per_sec);
+ else
+ return get_cb_cost(sbi, segno);
+}
+
+/**
+ * This function is called from two pathes.
+ * One is garbage collection and the other is SSR segment selection.
+ * When it is called during GC, it just gets a victim segment
+ * and it does not remove it from dirty seglist.
+ * When it is called from SSR segment selection, it finds a segment
+ * which has minimum valid blocks and removes it from dirty seglist.
+ */
+static int get_victim_by_default(struct f2fs_sb_info *sbi,
+ unsigned int *result, int gc_type, int type, char alloc_mode)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ struct victim_sel_policy p;
+ unsigned int segno;
+ int nsearched = 0;
+
+ p.alloc_mode = alloc_mode;
+ select_policy(sbi, gc_type, type, &p);
+
+ p.min_segno = NULL_SEGNO;
+ p.min_cost = get_max_cost(sbi, &p);
+
+ mutex_lock(&dirty_i->seglist_lock);
+
+ if (p.alloc_mode == LFS && gc_type == FG_GC) {
+ p.min_segno = check_bg_victims(sbi);
+ if (p.min_segno != NULL_SEGNO)
+ goto got_it;
+ }
+
+ while (1) {
+ unsigned long cost;
+
+ segno = find_next_bit(p.dirty_segmap,
+ TOTAL_SEGS(sbi), p.offset);
+ if (segno >= TOTAL_SEGS(sbi)) {
+ if (sbi->last_victim[p.gc_mode]) {
+ sbi->last_victim[p.gc_mode] = 0;
+ p.offset = 0;
+ continue;
+ }
+ break;
+ }
+ p.offset = ((segno / p.ofs_unit) * p.ofs_unit) + p.ofs_unit;
+
+ if (test_bit(segno, dirty_i->victim_segmap[FG_GC]))
+ continue;
+ if (gc_type == BG_GC &&
+ test_bit(segno, dirty_i->victim_segmap[BG_GC]))
+ continue;
+ if (IS_CURSEC(sbi, GET_SECNO(sbi, segno)))
+ continue;
+
+ cost = get_gc_cost(sbi, segno, &p);
+
+ if (p.min_cost > cost) {
+ p.min_segno = segno;
+ p.min_cost = cost;
+ }
+
+ if (cost == get_max_cost(sbi, &p))
+ continue;
+
+ if (nsearched++ >= MAX_VICTIM_SEARCH) {
+ sbi->last_victim[p.gc_mode] = segno;
+ break;
+ }
+ }
+got_it:
+ if (p.min_segno != NULL_SEGNO) {
+ *result = (p.min_segno / p.ofs_unit) * p.ofs_unit;
+ if (p.alloc_mode == LFS) {
+ int i;
+ for (i = 0; i < p.ofs_unit; i++)
+ set_bit(*result + i,
+ dirty_i->victim_segmap[gc_type]);
+ }
+ }
+ mutex_unlock(&dirty_i->seglist_lock);
+
+ return (p.min_segno == NULL_SEGNO) ? 0 : 1;
+}
+
+static const struct victim_selection default_v_ops = {
+ .get_victim = get_victim_by_default,
+};
+
+static struct inode *find_gc_inode(nid_t ino, struct list_head *ilist)
+{
+ struct list_head *this;
+ struct inode_entry *ie;
+
+ list_for_each(this, ilist) {
+ ie = list_entry(this, struct inode_entry, list);
+ if (ie->inode->i_ino == ino)
+ return ie->inode;
+ }
+ return NULL;
+}
+
+static void add_gc_inode(struct inode *inode, struct list_head *ilist)
+{
+ struct list_head *this;
+ struct inode_entry *new_ie, *ie;
+
+ list_for_each(this, ilist) {
+ ie = list_entry(this, struct inode_entry, list);
+ if (ie->inode == inode) {
+ iput(inode);
+ return;
+ }
+ }
+repeat:
+ new_ie = kmem_cache_alloc(winode_slab, GFP_NOFS);
+ if (!new_ie) {
+ cond_resched();
+ goto repeat;
+ }
+ new_ie->inode = inode;
+ list_add_tail(&new_ie->list, ilist);
+}
+
+static void put_gc_inode(struct list_head *ilist)
+{
+ struct inode_entry *ie, *next_ie;
+ list_for_each_entry_safe(ie, next_ie, ilist, list) {
+ iput(ie->inode);
+ list_del(&ie->list);
+ kmem_cache_free(winode_slab, ie);
+ }
+}
+
+static int check_valid_map(struct f2fs_sb_info *sbi,
+ unsigned int segno, int offset)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ struct seg_entry *sentry;
+ int ret;
+
+ mutex_lock(&sit_i->sentry_lock);
+ sentry = get_seg_entry(sbi, segno);
+ ret = f2fs_test_bit(offset, sentry->cur_valid_map);
+ mutex_unlock(&sit_i->sentry_lock);
+ return ret ? GC_OK : GC_NEXT;
+}
+
+/**
+ * This function compares node address got in summary with that in NAT.
+ * On validity, copy that node with cold status, otherwise (invalid node)
+ * ignore that.
+ */
+static int gc_node_segment(struct f2fs_sb_info *sbi,
+ struct f2fs_summary *sum, unsigned int segno, int gc_type)
+{
+ bool initial = true;
+ struct f2fs_summary *entry;
+ int off;
+
+next_step:
+ entry = sum;
+ for (off = 0; off < sbi->blocks_per_seg; off++, entry++) {
+ nid_t nid = le32_to_cpu(entry->nid);
+ struct page *node_page;
+ int err;
+
+ /*
+ * It makes sure that free segments are able to write
+ * all the dirty node pages before CP after this CP.
+ * So let's check the space of dirty node pages.
+ */
+ if (should_do_checkpoint(sbi)) {
+ mutex_lock(&sbi->cp_mutex);
+ block_operations(sbi);
+ return GC_BLOCKED;
+ }
+
+ err = check_valid_map(sbi, segno, off);
+ if (err == GC_ERROR)
+ return err;
+ else if (err == GC_NEXT)
+ continue;
+
+ if (initial) {
+ ra_node_page(sbi, nid);
+ continue;
+ }
+ node_page = get_node_page(sbi, nid);
+ if (IS_ERR(node_page))
+ continue;
+
+ /* set page dirty and write it */
+ if (!PageWriteback(node_page))
+ set_page_dirty(node_page);
+ f2fs_put_page(node_page, 1);
+ stat_inc_node_blk_count(sbi, 1);
+ }
+ if (initial) {
+ initial = false;
+ goto next_step;
+ }
+
+ if (gc_type == FG_GC) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = LONG_MAX,
+ .for_reclaim = 0,
+ };
+ sync_node_pages(sbi, 0, &wbc);
+ }
+ return GC_DONE;
+}
+
+/**
+ * Calculate start block index that this node page contains
+ */
+block_t start_bidx_of_node(unsigned int node_ofs)
+{
+ block_t start_bidx;
+ unsigned int bidx, indirect_blks;
+ int dec;
+
+ indirect_blks = 2 * NIDS_PER_BLOCK + 4;
+
+ start_bidx = 1;
+ if (node_ofs == 0) {
+ start_bidx = 0;
+ } else if (node_ofs <= 2) {
+ bidx = node_ofs - 1;
+ } else if (node_ofs <= indirect_blks) {
+ dec = (node_ofs - 4) / (NIDS_PER_BLOCK + 1);
+ bidx = node_ofs - 2 - dec;
+ } else {
+ dec = (node_ofs - indirect_blks - 3) / (NIDS_PER_BLOCK + 1);
+ bidx = node_ofs - 5 - dec;
+ }
+
+ if (start_bidx)
+ start_bidx = bidx * ADDRS_PER_BLOCK + ADDRS_PER_INODE;
+ return start_bidx;
+}
+
+static int check_dnode(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
+ struct node_info *dni, block_t blkaddr, unsigned int *nofs)
+{
+ struct page *node_page;
+ nid_t nid;
+ unsigned int ofs_in_node;
+ block_t source_blkaddr;
+
+ nid = le32_to_cpu(sum->nid);
+ ofs_in_node = le16_to_cpu(sum->ofs_in_node);
+
+ node_page = get_node_page(sbi, nid);
+ if (IS_ERR(node_page))
+ return GC_NEXT;
+
+ get_node_info(sbi, nid, dni);
+
+ if (sum->version != dni->version) {
+ f2fs_put_page(node_page, 1);
+ return GC_NEXT;
+ }
+
+ *nofs = ofs_of_node(node_page);
+ source_blkaddr = datablock_addr(node_page, ofs_in_node);
+ f2fs_put_page(node_page, 1);
+
+ if (source_blkaddr != blkaddr)
+ return GC_NEXT;
+ return GC_OK;
+}
+
+static void move_data_page(struct inode *inode, struct page *page, int gc_type)
+{
+ if (page->mapping != inode->i_mapping)
+ goto out;
+
+ if (inode != page->mapping->host)
+ goto out;
+
+ if (PageWriteback(page))
+ goto out;
+
+ if (gc_type == BG_GC) {
+ set_page_dirty(page);
+ set_cold_data(page);
+ } else {
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ mutex_lock_op(sbi, DATA_WRITE);
+ if (clear_page_dirty_for_io(page) &&
+ S_ISDIR(inode->i_mode)) {
+ dec_page_count(sbi, F2FS_DIRTY_DENTS);
+ inode_dec_dirty_dents(inode);
+ }
+ set_cold_data(page);
+ do_write_data_page(page);
+ mutex_unlock_op(sbi, DATA_WRITE);
+ clear_cold_data(page);
+ }
+out:
+ f2fs_put_page(page, 1);
+}
+
+/**
+ * This function tries to get parent node of victim data block, and identifies
+ * data block validity. If the block is valid, copy that with cold status and
+ * modify parent node.
+ * If the parent node is not valid or the data block address is different,
+ * the victim data block is ignored.
+ */
+static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
+ struct list_head *ilist, unsigned int segno, int gc_type)
+{
+ struct super_block *sb = sbi->sb;
+ struct f2fs_summary *entry;
+ block_t start_addr;
+ int err, off;
+ int phase = 0;
+
+ start_addr = START_BLOCK(sbi, segno);
+
+next_step:
+ entry = sum;
+ for (off = 0; off < sbi->blocks_per_seg; off++, entry++) {
+ struct page *data_page;
+ struct inode *inode;
+ struct node_info dni; /* dnode info for the data */
+ unsigned int ofs_in_node, nofs;
+ block_t start_bidx;
+
+ /*
+ * It makes sure that free segments are able to write
+ * all the dirty node pages before CP after this CP.
+ * So let's check the space of dirty node pages.
+ */
+ if (should_do_checkpoint(sbi)) {
+ mutex_lock(&sbi->cp_mutex);
+ block_operations(sbi);
+ err = GC_BLOCKED;
+ goto stop;
+ }
+
+ err = check_valid_map(sbi, segno, off);
+ if (err == GC_ERROR)
+ goto stop;
+ else if (err == GC_NEXT)
+ continue;
+
+ if (phase == 0) {
+ ra_node_page(sbi, le32_to_cpu(entry->nid));
+ continue;
+ }
+
+ /* Get an inode by ino with checking validity */
+ err = check_dnode(sbi, entry, &dni, start_addr + off, &nofs);
+ if (err == GC_ERROR)
+ goto stop;
+ else if (err == GC_NEXT)
+ continue;
+
+ if (phase == 1) {
+ ra_node_page(sbi, dni.ino);
+ continue;
+ }
+
+ start_bidx = start_bidx_of_node(nofs);
+ ofs_in_node = le16_to_cpu(entry->ofs_in_node);
+
+ if (phase == 2) {
+ inode = f2fs_iget_nowait(sb, dni.ino);
+ if (IS_ERR(inode))
+ continue;
+
+ data_page = find_data_page(inode,
+ start_bidx + ofs_in_node);
+ if (IS_ERR(data_page))
+ goto next_iput;
+
+ f2fs_put_page(data_page, 0);
+ add_gc_inode(inode, ilist);
+ } else {
+ inode = find_gc_inode(dni.ino, ilist);
+ if (inode) {
+ data_page = get_lock_data_page(inode,
+ start_bidx + ofs_in_node);
+ if (IS_ERR(data_page))
+ continue;
+ move_data_page(inode, data_page, gc_type);
+ stat_inc_data_blk_count(sbi, 1);
+ }
+ }
+ continue;
+next_iput:
+ iput(inode);
+ }
+ if (++phase < 4)
+ goto next_step;
+ err = GC_DONE;
+stop:
+ if (gc_type == FG_GC)
+ f2fs_submit_bio(sbi, DATA, true);
+ return err;
+}
+
+static int __get_victim(struct f2fs_sb_info *sbi, unsigned int *victim,
+ int gc_type, int type)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ int ret;
+ mutex_lock(&sit_i->sentry_lock);
+ ret = DIRTY_I(sbi)->v_ops->get_victim(sbi, victim, gc_type, type, LFS);
+ mutex_unlock(&sit_i->sentry_lock);
+ return ret;
+}
+
+static int do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
+ struct list_head *ilist, int gc_type)
+{
+ struct page *sum_page;
+ struct f2fs_summary_block *sum;
+ int ret = GC_DONE;
+
+ /* read segment summary of victim */
+ sum_page = get_sum_page(sbi, segno);
+ if (IS_ERR(sum_page))
+ return GC_ERROR;
+
+ /*
+ * CP needs to lock sum_page. In this time, we don't need
+ * to lock this page, because this summary page is not gone anywhere.
+ * Also, this page is not gonna be updated before GC is done.
+ */
+ unlock_page(sum_page);
+ sum = page_address(sum_page);
+
+ switch (GET_SUM_TYPE((&sum->footer))) {
+ case SUM_TYPE_NODE:
+ ret = gc_node_segment(sbi, sum->entries, segno, gc_type);
+ break;
+ case SUM_TYPE_DATA:
+ ret = gc_data_segment(sbi, sum->entries, ilist, segno, gc_type);
+ break;
+ }
+ stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
+ stat_inc_call_count(sbi->stat_info);
+
+ f2fs_put_page(sum_page, 0);
+ return ret;
+}
+
+int f2fs_gc(struct f2fs_sb_info *sbi, int nGC)
+{
+ unsigned int segno;
+ int old_free_secs, cur_free_secs;
+ int gc_status, nfree;
+ struct list_head ilist;
+ int gc_type = BG_GC;
+
+ INIT_LIST_HEAD(&ilist);
+gc_more:
+ nfree = 0;
+ gc_status = GC_NONE;
+
+ if (has_not_enough_free_secs(sbi))
+ old_free_secs = reserved_sections(sbi);
+ else
+ old_free_secs = free_sections(sbi);
+
+ while (sbi->sb->s_flags & MS_ACTIVE) {
+ int i;
+ if (has_not_enough_free_secs(sbi))
+ gc_type = FG_GC;
+
+ cur_free_secs = free_sections(sbi) + nfree;
+
+ /* We got free space successfully. */
+ if (nGC < cur_free_secs - old_free_secs)
+ break;
+
+ if (!__get_victim(sbi, &segno, gc_type, NO_CHECK_TYPE))
+ break;
+
+ for (i = 0; i < sbi->segs_per_sec; i++) {
+ /*
+ * do_garbage_collect will give us three gc_status:
+ * GC_ERROR, GC_DONE, and GC_BLOCKED.
+ * If GC is finished uncleanly, we have to return
+ * the victim to dirty segment list.
+ */
+ gc_status = do_garbage_collect(sbi, segno + i,
+ &ilist, gc_type);
+ if (gc_status != GC_DONE)
+ goto stop;
+ nfree++;
+ }
+ }
+stop:
+ if (has_not_enough_free_secs(sbi) || gc_status == GC_BLOCKED) {
+ write_checkpoint(sbi, (gc_status == GC_BLOCKED), false);
+ if (nfree)
+ goto gc_more;
+ }
+ mutex_unlock(&sbi->gc_mutex);
+
+ put_gc_inode(&ilist);
+ BUG_ON(!list_empty(&ilist));
+ return gc_status;
+}
+
+void build_gc_manager(struct f2fs_sb_info *sbi)
+{
+ DIRTY_I(sbi)->v_ops = &default_v_ops;
+}
+
+int create_gc_caches(void)
+{
+ winode_slab = f2fs_kmem_cache_create("f2fs_gc_inodes",
+ sizeof(struct inode_entry), NULL);
+ if (!winode_slab)
+ return -ENOMEM;
+ return 0;
+}
+
+void destroy_gc_caches(void)
+{
+ kmem_cache_destroy(winode_slab);
+}
diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
new file mode 100644
index 0000000..cf42a55
--- /dev/null
+++ b/fs/f2fs/gc.h
@@ -0,0 +1,117 @@
+/**
+ * fs/f2fs/gc.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#define GC_THREAD_NAME "f2fs_gc_task"
+#define GC_THREAD_MIN_WB_PAGES 1 /*
+ * a threshold to determine
+ * whether IO subsystem is idle
+ * or not
+ */
+#define GC_THREAD_MIN_SLEEP_TIME 10000 /* milliseconds */
+#define GC_THREAD_MAX_SLEEP_TIME 30000
+#define GC_THREAD_NOGC_SLEEP_TIME 10000
+#define LIMIT_INVALID_BLOCK 40 /* percentage over total user space */
+#define LIMIT_FREE_BLOCK 40 /* percentage over invalid + free space */
+
+/* Search max. number of dirty segments to select a victim segment */
+#define MAX_VICTIM_SEARCH 20
+
+enum {
+ GC_NONE = 0,
+ GC_ERROR,
+ GC_OK,
+ GC_NEXT,
+ GC_BLOCKED,
+ GC_DONE,
+};
+
+struct f2fs_gc_kthread {
+ struct task_struct *f2fs_gc_task;
+ wait_queue_head_t gc_wait_queue_head;
+};
+
+struct inode_entry {
+ struct list_head list;
+ struct inode *inode;
+};
+
+/**
+ * inline functions
+ */
+static inline block_t free_user_blocks(struct f2fs_sb_info *sbi)
+{
+ if (free_segments(sbi) < overprovision_segments(sbi))
+ return 0;
+ else
+ return (free_segments(sbi) - overprovision_segments(sbi))
+ << sbi->log_blocks_per_seg;
+}
+
+static inline block_t limit_invalid_user_blocks(struct f2fs_sb_info *sbi)
+{
+ return (long)(sbi->user_block_count * LIMIT_INVALID_BLOCK) / 100;
+}
+
+static inline block_t limit_free_user_blocks(struct f2fs_sb_info *sbi)
+{
+ block_t reclaimable_user_blocks = sbi->user_block_count -
+ written_block_count(sbi);
+ return (long)(reclaimable_user_blocks * LIMIT_FREE_BLOCK) / 100;
+}
+
+static inline long increase_sleep_time(long wait)
+{
+ wait += GC_THREAD_MIN_SLEEP_TIME;
+ if (wait > GC_THREAD_MAX_SLEEP_TIME)
+ wait = GC_THREAD_MAX_SLEEP_TIME;
+ return wait;
+}
+
+static inline long decrease_sleep_time(long wait)
+{
+ wait -= GC_THREAD_MIN_SLEEP_TIME;
+ if (wait <= GC_THREAD_MIN_SLEEP_TIME)
+ wait = GC_THREAD_MIN_SLEEP_TIME;
+ return wait;
+}
+
+static inline bool has_enough_invalid_blocks(struct f2fs_sb_info *sbi)
+{
+ block_t invalid_user_blocks = sbi->user_block_count -
+ written_block_count(sbi);
+ /*
+ * Background GC is triggered with the following condition.
+ * 1. There are a number of invalid blocks.
+ * 2. There is not enough free space.
+ */
+ if (invalid_user_blocks > limit_invalid_user_blocks(sbi) &&
+ free_user_blocks(sbi) < limit_free_user_blocks(sbi))
+ return true;
+ return false;
+}
+
+static inline int is_idle(struct f2fs_sb_info *sbi)
+{
+ struct block_device *bdev = sbi->sb->s_bdev;
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct request_list *rl = &q->root_rl;
+ return !(rl->count[BLK_RW_SYNC]) && !(rl->count[BLK_RW_ASYNC]);
+}
+
+static inline bool should_do_checkpoint(struct f2fs_sb_info *sbi)
+{
+ unsigned int pages_per_sec = sbi->segs_per_sec *
+ (1 << sbi->log_blocks_per_seg);
+ int node_secs = ((get_pages(sbi, F2FS_DIRTY_NODES) + pages_per_sec - 1)
+ >> sbi->log_blocks_per_seg) / sbi->segs_per_sec;
+ int dent_secs = ((get_pages(sbi, F2FS_DIRTY_DENTS) + pages_per_sec - 1)
+ >> sbi->log_blocks_per_seg) / sbi->segs_per_sec;
+ return free_sections(sbi) <= (node_secs + 2 * dent_secs + 2);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:49:15

[permalink] [raw]

Subject: [PATCH 15/17] f2fs: add recovery routines for roll-forward

This adds roll-forward routines to recover fsynced data.

- F2FS uses basically roll-back model with checkpointing.

- In order to implement fsync(), there are two approaches as follows.

1. A roll-back model with checkpointing at every fsync()
: This is a naive method, but suffers from very low performance.

2. A roll-forward model
: F2FS adopts this model where all the fsynced data should be recovered, which
were written after checkpointing was done. In order to figure out the data,
F2FS keeps a "fsync" mark in direct node blocks. In addition, F2FS remains
the location of next node block in each direct node block for reconstructing
the chain of node blocks during the recovery.

- In order to enhance the performance, F2FS keeps a "dentry" mark also in direct
node blocks. If this is set during the recovery, F2FS replays adding a dentry.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/recovery.c | 375 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 375 insertions(+)
create mode 100644 fs/f2fs/recovery.c

diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
new file mode 100644
index 0000000..7a43df0
--- /dev/null
+++ b/fs/f2fs/recovery.c
@@ -0,0 +1,375 @@
+/**
+ * fs/f2fs/recovery.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+
+static struct kmem_cache *fsync_entry_slab;
+
+bool space_for_roll_forward(struct f2fs_sb_info *sbi)
+{
+ if (sbi->last_valid_block_count + sbi->alloc_valid_block_count
+ > sbi->user_block_count)
+ return false;
+ return true;
+}
+
+static struct fsync_inode_entry *get_fsync_inode(struct list_head *head,
+ nid_t ino)
+{
+ struct list_head *this;
+ struct fsync_inode_entry *entry;
+
+ list_for_each(this, head) {
+ entry = list_entry(this, struct fsync_inode_entry, list);
+ if (entry->inode->i_ino == ino)
+ return entry;
+ }
+ return NULL;
+}
+
+static int recover_dentry(struct page *ipage, struct inode *inode)
+{
+ struct f2fs_node *raw_node = (struct f2fs_node *)kmap(ipage);
+ struct f2fs_inode *raw_inode = &(raw_node->i);
+ struct dentry dent, parent;
+ struct f2fs_dir_entry *de;
+ struct page *page;
+ struct inode *dir;
+ int err = 0;
+
+ if (!is_dent_dnode(ipage))
+ goto out;
+
+ dir = f2fs_iget(inode->i_sb, le32_to_cpu(raw_inode->i_pino));
+ if (IS_ERR(dir)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ parent.d_inode = dir;
+ dent.d_parent = &parent;
+ dent.d_name.len = le32_to_cpu(raw_inode->i_namelen);
+ dent.d_name.name = raw_inode->i_name;
+
+ de = f2fs_find_entry(dir, &dent.d_name, &page);
+ if (de) {
+ kunmap(page);
+ f2fs_put_page(page, 0);
+ } else {
+ f2fs_add_link(&dent, inode);
+ }
+ iput(dir);
+out:
+ kunmap(ipage);
+ return err;
+}
+
+static int recover_inode(struct inode *inode, struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *raw_node = (struct f2fs_node *)kaddr;
+ struct f2fs_inode *raw_inode = &(raw_node->i);
+
+ inode->i_mode = le32_to_cpu(raw_inode->i_mode);
+ i_size_write(inode, le64_to_cpu(raw_inode->i_size));
+ inode->i_atime.tv_sec = le64_to_cpu(raw_inode->i_mtime);
+ inode->i_ctime.tv_sec = le64_to_cpu(raw_inode->i_ctime);
+ inode->i_mtime.tv_sec = le64_to_cpu(raw_inode->i_mtime);
+ inode->i_atime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
+ inode->i_ctime.tv_nsec = le32_to_cpu(raw_inode->i_ctime_nsec);
+ inode->i_mtime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec);
+
+ return recover_dentry(node_page, inode);
+}
+
+static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head)
+{
+ unsigned long long cp_ver = le64_to_cpu(sbi->ckpt->checkpoint_ver);
+ struct curseg_info *curseg;
+ struct page *page;
+ block_t blkaddr;
+ int err = 0;
+
+ /* get node pages in the current segment */
+ curseg = CURSEG_I(sbi, CURSEG_WARM_NODE);
+ blkaddr = START_BLOCK(sbi, curseg->segno) + curseg->next_blkoff;
+
+ /* read node page */
+ page = alloc_page(GFP_F2FS_ZERO);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+ lock_page(page);
+
+ while (1) {
+ struct fsync_inode_entry *entry;
+
+ if (f2fs_readpage(sbi, page, blkaddr, READ_SYNC))
+ goto out;
+
+ if (cp_ver != cpver_of_node(page))
+ goto out;
+
+ if (!is_fsync_dnode(page))
+ goto next;
+
+ entry = get_fsync_inode(head, ino_of_node(page));
+ if (entry) {
+ entry->blkaddr = blkaddr;
+ if (IS_INODE(page) && is_dent_dnode(page))
+ set_inode_flag(F2FS_I(entry->inode),
+ FI_INC_LINK);
+ } else {
+ if (IS_INODE(page) && is_dent_dnode(page)) {
+ if (recover_inode_page(sbi, page)) {
+ err = -ENOMEM;
+ goto out;
+ }
+ }
+
+ /* add this fsync inode to the list */
+ entry = kmem_cache_alloc(fsync_entry_slab, GFP_NOFS);
+ if (!entry) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ INIT_LIST_HEAD(&entry->list);
+ list_add_tail(&entry->list, head);
+
+ entry->inode = f2fs_iget(sbi->sb, ino_of_node(page));
+ if (IS_ERR(entry->inode)) {
+ err = PTR_ERR(entry->inode);
+ goto out;
+ }
+ entry->blkaddr = blkaddr;
+ }
+ if (IS_INODE(page)) {
+ err = recover_inode(entry->inode, page);
+ if (err)
+ goto out;
+ }
+next:
+ /* check next segment */
+ blkaddr = next_blkaddr_of_node(page);
+ ClearPageUptodate(page);
+ }
+out:
+ unlock_page(page);
+ __free_pages(page, 0);
+ return err;
+}
+
+static void destroy_fsync_dnodes(struct f2fs_sb_info *sbi,
+ struct list_head *head)
+{
+ struct list_head *this;
+ struct fsync_inode_entry *entry;
+ list_for_each(this, head) {
+ entry = list_entry(this, struct fsync_inode_entry, list);
+ iput(entry->inode);
+ list_del(&entry->list);
+ kmem_cache_free(fsync_entry_slab, entry);
+ }
+}
+
+static void check_index_in_prev_nodes(struct f2fs_sb_info *sbi,
+ block_t blkaddr)
+{
+ struct seg_entry *sentry;
+ unsigned int segno = GET_SEGNO(sbi, blkaddr);
+ unsigned short blkoff = GET_SEGOFF_FROM_SEG0(sbi, blkaddr) &
+ (sbi->blocks_per_seg - 1);
+ struct f2fs_summary sum;
+ nid_t ino;
+ void *kaddr;
+ struct inode *inode;
+ struct page *node_page;
+ block_t bidx;
+ int i;
+
+ sentry = get_seg_entry(sbi, segno);
+ if (!f2fs_test_bit(blkoff, sentry->cur_valid_map))
+ return;
+
+ /* Get the previous summary */
+ for (i = CURSEG_WARM_DATA; i <= CURSEG_COLD_DATA; i++) {
+ struct curseg_info *curseg = CURSEG_I(sbi, i);
+ if (curseg->segno == segno) {
+ sum = curseg->sum_blk->entries[blkoff];
+ break;
+ }
+ }
+ if (i > CURSEG_COLD_DATA) {
+ struct page *sum_page = get_sum_page(sbi, segno);
+ struct f2fs_summary_block *sum_node;
+ kaddr = page_address(sum_page);
+ sum_node = (struct f2fs_summary_block *)kaddr;
+ sum = sum_node->entries[blkoff];
+ f2fs_put_page(sum_page, 1);
+ }
+
+ /* Get the node page */
+ node_page = get_node_page(sbi, le32_to_cpu(sum.nid));
+ bidx = start_bidx_of_node(ofs_of_node(node_page)) +
+ le16_to_cpu(sum.ofs_in_node);
+ ino = ino_of_node(node_page);
+ f2fs_put_page(node_page, 1);
+
+ /* Deallocate previous index in the node page */
+ inode = f2fs_iget_nowait(sbi->sb, ino);
+ truncate_hole(inode, bidx, bidx + 1);
+ iput(inode);
+}
+
+static void do_recover_data(struct f2fs_sb_info *sbi, struct inode *inode,
+ struct page *page, block_t blkaddr)
+{
+ unsigned int start, end;
+ struct dnode_of_data dn;
+ struct f2fs_summary sum;
+ struct node_info ni;
+
+ start = start_bidx_of_node(ofs_of_node(page));
+ if (IS_INODE(page))
+ end = start + ADDRS_PER_INODE;
+ else
+ end = start + ADDRS_PER_BLOCK;
+
+ set_new_dnode(&dn, inode, NULL, NULL, 0);
+ if (get_dnode_of_data(&dn, start, 0))
+ return;
+
+ wait_on_page_writeback(dn.node_page);
+
+ get_node_info(sbi, dn.nid, &ni);
+ BUG_ON(ni.ino != ino_of_node(page));
+ BUG_ON(ofs_of_node(dn.node_page) != ofs_of_node(page));
+
+ for (; start < end; start++) {
+ block_t src, dest;
+
+ src = datablock_addr(dn.node_page, dn.ofs_in_node);
+ dest = datablock_addr(page, dn.ofs_in_node);
+
+ if (src != dest && dest != NEW_ADDR && dest != NULL_ADDR) {
+ if (src == NULL_ADDR) {
+ int err = reserve_new_block(&dn);
+ /* We should not get -ENOSPC */
+ BUG_ON(err);
+ }
+
+ /* Check the previous node page having this index */
+ check_index_in_prev_nodes(sbi, dest);
+
+ set_summary(&sum, dn.nid, dn.ofs_in_node, ni.version);
+
+ /* write dummy data page */
+ recover_data_page(sbi, NULL, &sum, src, dest);
+ update_extent_cache(dest, &dn);
+ }
+ dn.ofs_in_node++;
+ }
+
+ /* write node page in place */
+ set_summary(&sum, dn.nid, 0, 0);
+ if (IS_INODE(dn.node_page))
+ sync_inode_page(&dn);
+
+ copy_node_footer(dn.node_page, page);
+ fill_node_footer(dn.node_page, dn.nid, ni.ino,
+ ofs_of_node(page), false);
+ set_page_dirty(dn.node_page);
+
+ recover_node_page(sbi, dn.node_page, &sum, &ni, blkaddr);
+ f2fs_put_dnode(&dn);
+}
+
+static void recover_data(struct f2fs_sb_info *sbi,
+ struct list_head *head, int type)
+{
+ unsigned long long cp_ver = le64_to_cpu(sbi->ckpt->checkpoint_ver);
+ struct curseg_info *curseg;
+ struct page *page;
+ block_t blkaddr;
+
+ /* get node pages in the current segment */
+ curseg = CURSEG_I(sbi, type);
+ blkaddr = NEXT_FREE_BLKADDR(sbi, curseg);
+
+ /* read node page */
+ page = alloc_page(GFP_NOFS | __GFP_ZERO);
+ if (IS_ERR(page))
+ return;
+ lock_page(page);
+
+ while (1) {
+ struct fsync_inode_entry *entry;
+
+ if (f2fs_readpage(sbi, page, blkaddr, READ_SYNC))
+ goto out;
+
+ if (cp_ver != cpver_of_node(page))
+ goto out;
+
+ entry = get_fsync_inode(head, ino_of_node(page));
+ if (!entry)
+ goto next;
+
+ do_recover_data(sbi, entry->inode, page, blkaddr);
+
+ if (entry->blkaddr == blkaddr) {
+ iput(entry->inode);
+ list_del(&entry->list);
+ kmem_cache_free(fsync_entry_slab, entry);
+ }
+next:
+ /* check next segment */
+ blkaddr = next_blkaddr_of_node(page);
+ ClearPageUptodate(page);
+ }
+out:
+ unlock_page(page);
+ __free_pages(page, 0);
+
+ allocate_new_segments(sbi);
+}
+
+void recover_fsync_data(struct f2fs_sb_info *sbi)
+{
+ struct list_head inode_list;
+
+ fsync_entry_slab = f2fs_kmem_cache_create("f2fs_fsync_inode_entry",
+ sizeof(struct fsync_inode_entry), NULL);
+ if (unlikely(!fsync_entry_slab))
+ return;
+
+ INIT_LIST_HEAD(&inode_list);
+
+ /* step #1: find fsynced inode numbers */
+ if (find_fsync_dnodes(sbi, &inode_list))
+ goto out;
+
+ if (list_empty(&inode_list))
+ goto out;
+
+ /* step #2: recover data */
+ sbi->por_doing = 1;
+ recover_data(sbi, &inode_list, CURSEG_WARM_NODE);
+ sbi->por_doing = 0;
+ BUG_ON(!list_empty(&inode_list));
+out:
+ destroy_fsync_dnodes(sbi, &inode_list);
+ kmem_cache_destroy(fsync_entry_slab);
+ write_checkpoint(sbi, false, false);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:49:52

[permalink] [raw]

Subject: [PATCH 16/17] f2fs: move proc files to debugfs

This moves all of the f2fs debugging files into debugfs. The files are
located in /sys/kernel/debug/f2fs/

Note, I think we are generating all of the same information in each of
the files for every unique f2fs filesystem in the machine. This copies
the functionality that was present in the proc files, but this should be
fixed up in the future.

Signed-off-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/debug.c | 361 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 361 insertions(+)
create mode 100644 fs/f2fs/debug.c

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
new file mode 100644
index 0000000..a56181c
--- /dev/null
+++ b/fs/f2fs/debug.c
@@ -0,0 +1,361 @@
+/**
+ * f2fs debugging statistics
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ * Copyright (c) 2012 Linux Foundation
+ * Copyright (c) 2012 Greg Kroah-Hartman <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/blkdev.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+#include "gc.h"
+
+static LIST_HEAD(f2fs_stat_list);
+static struct dentry *debugfs_root;
+
+void update_general_status(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ int i;
+
+ /* valid check of the segment numbers */
+ si->hit_ext = sbi->read_hit_ext;
+ si->total_ext = sbi->total_hit_ext;
+ si->ndirty_node = get_pages(sbi, F2FS_DIRTY_NODES);
+ si->ndirty_dent = get_pages(sbi, F2FS_DIRTY_DENTS);
+ si->ndirty_dirs = sbi->n_dirty_dirs;
+ si->ndirty_meta = get_pages(sbi, F2FS_DIRTY_META);
+ si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
+ si->rsvd_segs = reserved_segments(sbi);
+ si->overp_segs = overprovision_segments(sbi);
+ si->valid_count = valid_user_blocks(sbi);
+ si->valid_node_count = valid_node_count(sbi);
+ si->valid_inode_count = valid_inode_count(sbi);
+ si->utilization = utilization(sbi);
+
+ si->free_segs = free_segments(sbi);
+ si->free_secs = free_sections(sbi);
+ si->prefree_count = prefree_segments(sbi);
+ si->dirty_count = dirty_segments(sbi);
+ si->node_pages = sbi->node_inode->i_mapping->nrpages;
+ si->meta_pages = sbi->meta_inode->i_mapping->nrpages;
+ si->nats = NM_I(sbi)->nat_cnt;
+ si->sits = SIT_I(sbi)->dirty_sentries;
+ si->fnids = NM_I(sbi)->fcnt;
+ si->bg_gc = sbi->bg_gc;
+ si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
+ * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
+ / 2;
+ si->util_valid = (int)(written_block_count(sbi) >>
+ sbi->log_blocks_per_seg)
+ * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
+ / 2;
+ si->util_invalid = 50 - si->util_free - si->util_valid;
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_NODE; i++) {
+ struct curseg_info *curseg = CURSEG_I(sbi, i);
+ si->curseg[i] = curseg->segno;
+ si->cursec[i] = curseg->segno / sbi->segs_per_sec;
+ si->curzone[i] = si->cursec[i] / sbi->secs_per_zone;
+ }
+
+ for (i = 0; i < 2; i++) {
+ si->segment_count[i] = sbi->segment_count[i];
+ si->block_count[i] = sbi->block_count[i];
+ }
+}
+
+/**
+ * This function calculates BDF of every segments
+ */
+static void update_sit_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ unsigned int blks_per_sec, hblks_per_sec, total_vblocks, bimodal, dist;
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int segno, vblocks;
+ int ndirty = 0;
+
+ bimodal = 0;
+ total_vblocks = 0;
+ blks_per_sec = sbi->segs_per_sec * (1 << sbi->log_blocks_per_seg);
+ hblks_per_sec = blks_per_sec / 2;
+ mutex_lock(&sit_i->sentry_lock);
+ for (segno = 0; segno < TOTAL_SEGS(sbi); segno += sbi->segs_per_sec) {
+ vblocks = get_valid_blocks(sbi, segno, sbi->segs_per_sec);
+ dist = abs(vblocks - hblks_per_sec);
+ bimodal += dist * dist;
+
+ if (vblocks > 0 && vblocks < blks_per_sec) {
+ total_vblocks += vblocks;
+ ndirty++;
+ }
+ }
+ mutex_unlock(&sit_i->sentry_lock);
+ dist = sbi->total_sections * hblks_per_sec * hblks_per_sec / 100;
+ si->bimodal = bimodal / dist;
+ if (si->dirty_count)
+ si->avg_vblocks = total_vblocks / ndirty;
+ else
+ si->avg_vblocks = 0;
+}
+
+/**
+ * This function calculates memory footprint.
+ */
+static void update_mem_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ unsigned npages;
+
+ if (si->base_mem)
+ goto get_cache;
+
+ si->base_mem = sizeof(struct f2fs_sb_info) + sbi->sb->s_blocksize;
+ si->base_mem += 2 * sizeof(struct f2fs_inode_info);
+ si->base_mem += sizeof(*sbi->ckpt);
+
+ /* build sm */
+ si->base_mem += sizeof(struct f2fs_sm_info);
+
+ /* build sit */
+ si->base_mem += sizeof(struct sit_info);
+ si->base_mem += TOTAL_SEGS(sbi) * sizeof(struct seg_entry);
+ si->base_mem += f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += 2 * SIT_VBLOCK_MAP_SIZE * TOTAL_SEGS(sbi);
+ if (sbi->segs_per_sec > 1)
+ si->base_mem += sbi->total_sections *
+ sizeof(struct sec_entry);
+ si->base_mem += __bitmap_size(sbi, SIT_BITMAP);
+
+ /* build free segmap */
+ si->base_mem += sizeof(struct free_segmap_info);
+ si->base_mem += f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += f2fs_bitmap_size(sbi->total_sections);
+
+ /* build curseg */
+ si->base_mem += sizeof(struct curseg_info) * NR_CURSEG_TYPE;
+ si->base_mem += PAGE_CACHE_SIZE * NR_CURSEG_TYPE;
+
+ /* build dirty segmap */
+ si->base_mem += sizeof(struct dirty_seglist_info);
+ si->base_mem += NR_DIRTY_TYPE * f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += 2 * f2fs_bitmap_size(TOTAL_SEGS(sbi));
+
+ /* buld nm */
+ si->base_mem += sizeof(struct f2fs_nm_info);
+ si->base_mem += __bitmap_size(sbi, NAT_BITMAP);
+
+ /* build gc */
+ si->base_mem += sizeof(struct f2fs_gc_kthread);
+
+get_cache:
+ /* free nids */
+ si->cache_mem = NM_I(sbi)->fcnt;
+ si->cache_mem += NM_I(sbi)->nat_cnt;
+ npages = sbi->node_inode->i_mapping->nrpages;
+ si->cache_mem += npages << PAGE_CACHE_SHIFT;
+ npages = sbi->meta_inode->i_mapping->nrpages;
+ si->cache_mem += npages << PAGE_CACHE_SHIFT;
+ si->cache_mem += sbi->n_orphans * sizeof(struct orphan_inode_entry);
+ si->cache_mem += sbi->n_dirty_dirs * sizeof(struct dir_inode_entry);
+}
+
+static int stat_show(struct seq_file *s, void *v)
+{
+ struct f2fs_stat_info *si, *next;
+ int i = 0;
+ int j;
+
+ list_for_each_entry_safe(si, next, &f2fs_stat_list, stat_list) {
+
+ mutex_lock(&si->stat_lock);
+ if (!si->sbi) {
+ mutex_unlock(&si->stat_lock);
+ continue;
+ }
+ update_general_status(si->sbi);
+
+ seq_printf(s, "\n=====[ partition info. #%d ]=====\n", i++);
+ seq_printf(s, "[SB: 1] [CP: 2] [NAT: %d] [SIT: %d] ",
+ si->nat_area_segs, si->sit_area_segs);
+ seq_printf(s, "[SSA: %d] [MAIN: %d",
+ si->ssa_area_segs, si->main_area_segs);
+ seq_printf(s, "(OverProv:%d Resv:%d)]\n\n",
+ si->overp_segs, si->rsvd_segs);
+ seq_printf(s, "Utilization: %d%% (%d valid blocks)\n",
+ si->utilization, si->valid_count);
+ seq_printf(s, " - Node: %u (Inode: %u, ",
+ si->valid_node_count, si->valid_inode_count);
+ seq_printf(s, "Other: %u)\n - Data: %u\n",
+ si->valid_node_count - si->valid_inode_count,
+ si->valid_count - si->valid_node_count);
+ seq_printf(s, "\nMain area: %d segs, %d secs %d zones\n",
+ si->main_area_segs, si->main_area_sections,
+ si->main_area_zones);
+ seq_printf(s, " - COLD data: %d, %d, %d\n",
+ si->curseg[CURSEG_COLD_DATA],
+ si->cursec[CURSEG_COLD_DATA],
+ si->curzone[CURSEG_COLD_DATA]);
+ seq_printf(s, " - WARM data: %d, %d, %d\n",
+ si->curseg[CURSEG_WARM_DATA],
+ si->cursec[CURSEG_WARM_DATA],
+ si->curzone[CURSEG_WARM_DATA]);
+ seq_printf(s, " - HOT data: %d, %d, %d\n",
+ si->curseg[CURSEG_HOT_DATA],
+ si->cursec[CURSEG_HOT_DATA],
+ si->curzone[CURSEG_HOT_DATA]);
+ seq_printf(s, " - Dir dnode: %d, %d, %d\n",
+ si->curseg[CURSEG_HOT_NODE],
+ si->cursec[CURSEG_HOT_NODE],
+ si->curzone[CURSEG_HOT_NODE]);
+ seq_printf(s, " - File dnode: %d, %d, %d\n",
+ si->curseg[CURSEG_WARM_NODE],
+ si->cursec[CURSEG_WARM_NODE],
+ si->curzone[CURSEG_WARM_NODE]);
+ seq_printf(s, " - Indir nodes: %d, %d, %d\n",
+ si->curseg[CURSEG_COLD_NODE],
+ si->cursec[CURSEG_COLD_NODE],
+ si->curzone[CURSEG_COLD_NODE]);
+ seq_printf(s, "\n - Valid: %d\n - Dirty: %d\n",
+ si->main_area_segs - si->dirty_count -
+ si->prefree_count - si->free_segs,
+ si->dirty_count);
+ seq_printf(s, " - Prefree: %d\n - Free: %d (%d)\n\n",
+ si->prefree_count, si->free_segs, si->free_secs);
+ seq_printf(s, "GC calls: %d (BG: %d)\n",
+ si->call_count, si->bg_gc);
+ seq_printf(s, " - data segments : %d\n", si->data_segs);
+ seq_printf(s, " - node segments : %d\n", si->node_segs);
+ seq_printf(s, "Try to move %d blocks\n", si->tot_blks);
+ seq_printf(s, " - data blocks : %d\n", si->data_blks);
+ seq_printf(s, " - node blocks : %d\n", si->node_blks);
+ seq_printf(s, "\nExtent Hit Ratio: %d / %d\n",
+ si->hit_ext, si->total_ext);
+ seq_printf(s, "\nBalancing F2FS Async:\n");
+ seq_printf(s, " - nodes %4d in %4d\n",
+ si->ndirty_node, si->node_pages);
+ seq_printf(s, " - dents %4d in dirs:%4d\n",
+ si->ndirty_dent, si->ndirty_dirs);
+ seq_printf(s, " - meta %4d in %4d\n",
+ si->ndirty_meta, si->meta_pages);
+ seq_printf(s, " - NATs %5d > %lu\n",
+ si->nats, NM_WOUT_THRESHOLD);
+ seq_printf(s, " - SITs: %5d\n - free_nids: %5d\n",
+ si->sits, si->fnids);
+ seq_printf(s, "\nDistribution of User Blocks:");
+ seq_printf(s, " [ valid | invalid | free ]\n");
+ seq_printf(s, " [");
+
+ for (j = 0; j < si->util_valid; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "|");
+
+ for (j = 0; j < si->util_invalid; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "|");
+
+ for (j = 0; j < si->util_free; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "]\n\n");
+ seq_printf(s, "SSR: %u blocks in %u segments\n",
+ si->block_count[SSR], si->segment_count[SSR]);
+ seq_printf(s, "LFS: %u blocks in %u segments\n",
+ si->block_count[LFS], si->segment_count[LFS]);
+
+ /* segment usage info */
+ update_sit_info(si->sbi);
+ seq_printf(s, "\nBDF: %u, avg. vblocks: %u\n",
+ si->bimodal, si->avg_vblocks);
+
+ /* memory footprint */
+ update_mem_info(si->sbi);
+ seq_printf(s, "\nMemory: %u KB = static: %u + cached: %u\n",
+ (si->base_mem + si->cache_mem) >> 10,
+ si->base_mem >> 10, si->cache_mem >> 10);
+ mutex_unlock(&si->stat_lock);
+ }
+ return 0;
+}
+
+static int stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stat_show, inode->i_private);
+}
+
+static const struct file_operations stat_fops = {
+ .open = stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int init_stats(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi);
+ struct f2fs_stat_info *si;
+
+ sbi->stat_info = kzalloc(sizeof(struct f2fs_stat_info), GFP_KERNEL);
+ if (!sbi->stat_info)
+ return -ENOMEM;
+
+ si = sbi->stat_info;
+ mutex_init(&si->stat_lock);
+ list_add_tail(&si->stat_list, &f2fs_stat_list);
+
+ si->all_area_segs = le32_to_cpu(raw_super->segment_count);
+ si->sit_area_segs = le32_to_cpu(raw_super->segment_count_sit);
+ si->nat_area_segs = le32_to_cpu(raw_super->segment_count_nat);
+ si->ssa_area_segs = le32_to_cpu(raw_super->segment_count_ssa);
+ si->main_area_segs = le32_to_cpu(raw_super->segment_count_main);
+ si->main_area_sections = le32_to_cpu(raw_super->section_count);
+ si->main_area_zones = si->main_area_sections /
+ le32_to_cpu(raw_super->secs_per_zone);
+ si->sbi = sbi;
+ return 0;
+}
+
+int f2fs_build_stats(struct f2fs_sb_info *sbi)
+{
+ int retval;
+
+ retval = init_stats(sbi);
+ if (retval)
+ return retval;
+
+ if (!debugfs_root)
+ debugfs_root = debugfs_create_dir("f2fs", NULL);
+
+ debugfs_create_file("status", S_IRUGO, debugfs_root, NULL, &stat_fops);
+ return 0;
+}
+
+void f2fs_destroy_stats(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+
+ list_del(&si->stat_list);
+ mutex_lock(&si->stat_lock);
+ si->sbi = NULL;
+ mutex_unlock(&si->stat_lock);
+ kfree(sbi->stat_info);
+}
+
+void destroy_root_stats(void)
+{
+ debugfs_remove_recursive(debugfs_root);
+ debugfs_root = NULL;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:50:31

[permalink] [raw]

Subject: [PATCH 17/17] f2fs: update Kconfig and Makefile

This adds Makefile and Kconfig for f2fs, and updates Makefile and Kconfig files
in the fs directory.
In addition, add F2FS_SUPER_MAGIC in magic.h.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/f2fs/Kconfig | 52 ++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/Makefile | 7 ++++++
include/uapi/linux/magic.h | 1 +
5 files changed, 62 insertions(+)
create mode 100644 fs/f2fs/Kconfig
create mode 100644 fs/f2fs/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index f95ae3a..e352b37 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -220,6 +220,7 @@ source "fs/pstore/Kconfig"
source "fs/sysv/Kconfig"
source "fs/ufs/Kconfig"
source "fs/exofs/Kconfig"
+source "fs/f2fs/Kconfig"

endif # MISC_FILESYSTEMS

diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..9d53192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -123,6 +123,7 @@ obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
+obj-$(CONFIG_F2FS_FS) += f2fs/
obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
diff --git a/fs/f2fs/Kconfig b/fs/f2fs/Kconfig
new file mode 100644
index 0000000..37a6b8e
--- /dev/null
+++ b/fs/f2fs/Kconfig
@@ -0,0 +1,52 @@
+config F2FS_FS
+ tristate "F2FS filesystem support (EXPERIMENTAL)"
+ help
+ F2FS is based on Log-structured File System (LFS), which supports
+ versatile "flash-friendly" features. The design has been focused on
+ addressing the fundamental issues in LFS, which are snowball effect
+ of wandering tree and high cleaning overhead.
+
+ Since flash-based storages show different characteristics according to
+ the internal geometry or flash memory management schemes aka FTL, F2FS
+ and tools support various parameters not only for configuring on-disk
+ layout, but also for selecting allocation and cleaning algorithms.
+
+ If unsure, say N.
+
+config F2FS_STAT_FS
+ bool "F2FS Status Information"
+ depends on F2FS_FS && DEBUG_FS
+ default y
+ help
+ /sys/kernel/debug/f2fs/ contains information about all the partitions
+ mounted as f2fs. Each file shows the whole f2fs information.
+
+ /sys/kernel/debug/f2fs/status includes:
+ - major file system information managed by f2fs currently
+ - average SIT information about whole segments
+ - current memory footprint consumed by f2fs.
+
+config F2FS_FS_XATTR
+ bool "F2FS extended attributes"
+ depends on F2FS_FS
+ default y
+ help
+ Extended attributes are name:value pairs associated with inodes by
+ the kernel or by users (see the attr(5) manual page, or visit
+ <http://acl.bestbits.at/> for details).
+
+ If unsure, say N.
+
+config F2FS_FS_POSIX_ACL
+ bool "F2FS Access Control Lists"
+ depends on F2FS_FS_XATTR
+ select FS_POSIX_ACL
+ default y
+ help
+ Posix Access Control Lists (ACLs) support permissions for users and
+ gourps beyond the owner/group/world scheme.
+
+ To learn more about Access Control Lists, visit the POSIX ACLs for
+ Linux website <http://acl.bestbits.at/>.
+
+ If you don't know what Access Control Lists are, say N
diff --git a/fs/f2fs/Makefile b/fs/f2fs/Makefile
new file mode 100644
index 0000000..27a0820
--- /dev/null
+++ b/fs/f2fs/Makefile
@@ -0,0 +1,7 @@
+obj-$(CONFIG_F2FS_FS) += f2fs.o
+
+f2fs-y := dir.o file.o inode.o namei.o hash.o super.o
+f2fs-y += checkpoint.o gc.o data.o node.o segment.o recovery.o
+f2fs-$(CONFIG_F2FS_STAT_FS) += debug.o
+f2fs-$(CONFIG_F2FS_FS_XATTR) += xattr.o
+f2fs-$(CONFIG_F2FS_FS_POSIX_ACL) += acl.o
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e15192c..66353ff 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -23,6 +23,7 @@
#define EXT4_SUPER_MAGIC 0xEF53
#define BTRFS_SUPER_MAGIC 0x9123683E
#define NILFS_SUPER_MAGIC 0x3434
+#define F2FS_SUPER_MAGIC 0xF2F52010
#define HPFS_SUPER_MAGIC 0xf995e849
#define ISOFS_SUPER_MAGIC 0x9660
#define JFFS2_SUPER_MAGIC 0x72b6
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:56:25

[permalink] [raw]

Subject: [PATCH 07/17] f2fs: add segment operations

This adds specific functions not only to manage dirty/free segments, SIT pages,
a cache for SIT entries, and summary entries, but also to allocate free blocks
and write three types of pages: data, node, and meta.

- F2FS maintains three types of bitmaps in memory, which indicate free, prefree,
and dirty segments respectively.

- The key information of an SIT entry consists of a segment number, the number
of valid blocks in the segment, a bitmap to identify there-in valid or invalid
blocks.

- An SIT page is composed of a certain range of SIT entries, which is maintained
by the address space of meta_inode.

- To cache SIT entries, a simple array is used. The index for the array is the
segment number.

- A summary entry for data contains the parent node information. A summary entry
for node contains its node offset from the inode.

- F2FS manages information about six active logs and those summary entries in
memory. Whenever one of them is changed, its summary entries are flushed to
its SIT page maintained by the address space of meta_inode.

- This patch adds a default block allocation function which supports heap-based
allocation policy.

- This patch adds core functions to write data, node, and meta pages. Since LFS
basically produces a series of sequential writes, F2FS merges sequential bios
with a single one as much as possible to reduce the IO scheduling overhead.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/segment.c | 1798 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1798 insertions(+)
create mode 100644 fs/f2fs/segment.c

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
new file mode 100644
index 0000000..ed7c079
--- /dev/null
+++ b/fs/f2fs/segment.c
@@ -0,0 +1,1798 @@
+/**
+ * fs/f2fs/segment.c
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/vmalloc.h>
+
+#include "f2fs.h"
+#include "segment.h"
+#include "node.h"
+
+static int need_to_flush(struct f2fs_sb_info *sbi)
+{
+ unsigned int pages_per_sec = (1 << sbi->log_blocks_per_seg) *
+ sbi->segs_per_sec;
+ int node_secs = ((get_pages(sbi, F2FS_DIRTY_NODES) + pages_per_sec - 1)
+ >> sbi->log_blocks_per_seg) / sbi->segs_per_sec;
+ int dent_secs = ((get_pages(sbi, F2FS_DIRTY_DENTS) + pages_per_sec - 1)
+ >> sbi->log_blocks_per_seg) / sbi->segs_per_sec;
+
+ if (sbi->por_doing)
+ return 0;
+
+ if (free_sections(sbi) <= (node_secs + 2 * dent_secs +
+ reserved_sections(sbi)))
+ return 1;
+ return 0;
+}
+
+/**
+ * This function balances dirty node and dentry pages.
+ * In addition, it controls garbage collection.
+ */
+void f2fs_balance_fs(struct f2fs_sb_info *sbi)
+{
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = LONG_MAX,
+ .for_reclaim = 0,
+ };
+
+ if (sbi->por_doing)
+ return;
+
+ /*
+ * We should do checkpoint when there are so many dirty node pages
+ * with enough free segments. After then, we should do GC.
+ */
+ if (need_to_flush(sbi)) {
+ sync_dirty_dir_inodes(sbi);
+ sync_node_pages(sbi, 0, &wbc);
+ }
+
+ if (has_not_enough_free_secs(sbi)) {
+ mutex_lock(&sbi->gc_mutex);
+ f2fs_gc(sbi, 1);
+ }
+}
+
+static void __locate_dirty_segment(struct f2fs_sb_info *sbi, unsigned int segno,
+ enum dirty_type dirty_type)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+
+ /* need not be added */
+ if (IS_CURSEG(sbi, segno))
+ return;
+
+ if (!test_and_set_bit(segno, dirty_i->dirty_segmap[dirty_type]))
+ dirty_i->nr_dirty[dirty_type]++;
+
+ if (dirty_type == DIRTY) {
+ struct seg_entry *sentry = get_seg_entry(sbi, segno);
+ dirty_type = sentry->type;
+ if (!test_and_set_bit(segno, dirty_i->dirty_segmap[dirty_type]))
+ dirty_i->nr_dirty[dirty_type]++;
+ }
+}
+
+static void __remove_dirty_segment(struct f2fs_sb_info *sbi, unsigned int segno,
+ enum dirty_type dirty_type)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+
+ if (test_and_clear_bit(segno, dirty_i->dirty_segmap[dirty_type]))
+ dirty_i->nr_dirty[dirty_type]--;
+
+ if (dirty_type == DIRTY) {
+ struct seg_entry *sentry = get_seg_entry(sbi, segno);
+ dirty_type = sentry->type;
+ if (test_and_clear_bit(segno,
+ dirty_i->dirty_segmap[dirty_type]))
+ dirty_i->nr_dirty[dirty_type]--;
+ clear_bit(segno, dirty_i->victim_segmap[FG_GC]);
+ clear_bit(segno, dirty_i->victim_segmap[BG_GC]);
+ }
+}
+
+/**
+ * Should not occur error such as -ENOMEM.
+ * Adding dirty entry into seglist is not critical operation.
+ * If a given segment is one of current working segments, it won't be added.
+ */
+void locate_dirty_segment(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned short valid_blocks;
+
+ if (segno == NULL_SEGNO || IS_CURSEG(sbi, segno))
+ return;
+
+ mutex_lock(&dirty_i->seglist_lock);
+
+ valid_blocks = get_valid_blocks(sbi, segno, 0);
+
+ if (valid_blocks == 0) {
+ __locate_dirty_segment(sbi, segno, PRE);
+ __remove_dirty_segment(sbi, segno, DIRTY);
+ } else if (valid_blocks < sbi->blocks_per_seg) {
+ __locate_dirty_segment(sbi, segno, DIRTY);
+ } else {
+ /* Recovery routine with SSR needs this */
+ __remove_dirty_segment(sbi, segno, DIRTY);
+ }
+
+ mutex_unlock(&dirty_i->seglist_lock);
+ return;
+}
+
+/**
+ * Should call clear_prefree_segments after checkpoint is done.
+ */
+static void set_prefree_as_free_segments(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned int segno, offset = 0;
+ unsigned int total_segs = TOTAL_SEGS(sbi);
+
+ mutex_lock(&dirty_i->seglist_lock);
+ while (1) {
+ segno = find_next_bit(dirty_i->dirty_segmap[PRE], total_segs,
+ offset);
+ if (segno >= total_segs)
+ break;
+ __set_test_and_free(sbi, segno);
+ offset = segno + 1;
+ }
+ mutex_unlock(&dirty_i->seglist_lock);
+}
+
+void clear_prefree_segments(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned int segno, offset = 0;
+ unsigned int total_segs = TOTAL_SEGS(sbi);
+
+ mutex_lock(&dirty_i->seglist_lock);
+ while (1) {
+ segno = find_next_bit(dirty_i->dirty_segmap[PRE], total_segs,
+ offset);
+ if (segno >= total_segs)
+ break;
+
+ offset = segno + 1;
+ if (test_and_clear_bit(segno, dirty_i->dirty_segmap[PRE]))
+ dirty_i->nr_dirty[PRE]--;
+
+ /* Let's use trim */
+ if (test_opt(sbi, DISCARD))
+ blkdev_issue_discard(sbi->sb->s_bdev,
+ START_BLOCK(sbi, segno) <<
+ sbi->log_sectors_per_block,
+ 1 << (sbi->log_sectors_per_block +
+ sbi->log_blocks_per_seg),
+ GFP_NOFS, 0);
+ }
+ mutex_unlock(&dirty_i->seglist_lock);
+}
+
+static void __mark_sit_entry_dirty(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ if (!__test_and_set_bit(segno, sit_i->dirty_sentries_bitmap))
+ sit_i->dirty_sentries++;
+}
+
+static void __set_sit_entry_type(struct f2fs_sb_info *sbi, int type,
+ unsigned int segno, int modified)
+{
+ struct seg_entry *se = get_seg_entry(sbi, segno);
+ se->type = type;
+ if (modified)
+ __mark_sit_entry_dirty(sbi, segno);
+}
+
+static void update_sit_entry(struct f2fs_sb_info *sbi, block_t blkaddr, int del)
+{
+ struct seg_entry *se;
+ unsigned int segno, offset;
+ long int new_vblocks;
+
+ segno = GET_SEGNO(sbi, blkaddr);
+
+ se = get_seg_entry(sbi, segno);
+ new_vblocks = se->valid_blocks + del;
+ offset = GET_SEGOFF_FROM_SEG0(sbi, blkaddr) & (sbi->blocks_per_seg - 1);
+
+ BUG_ON((new_vblocks >> (sizeof(unsigned short) << 3) ||
+ (new_vblocks > sbi->blocks_per_seg)));
+
+ se->valid_blocks = new_vblocks;
+ se->mtime = get_mtime(sbi);
+ SIT_I(sbi)->max_mtime = se->mtime;
+
+ /* Update valid block bitmap */
+ if (del > 0) {
+ if (f2fs_set_bit(offset, se->cur_valid_map))
+ BUG();
+ } else {
+ if (!f2fs_clear_bit(offset, se->cur_valid_map))
+ BUG();
+ }
+ if (!f2fs_test_bit(offset, se->ckpt_valid_map))
+ se->ckpt_valid_blocks += del;
+
+ __mark_sit_entry_dirty(sbi, segno);
+
+ /* update total number of valid blocks to be written in ckpt area */
+ SIT_I(sbi)->written_valid_blocks += del;
+
+ if (sbi->segs_per_sec > 1)
+ get_sec_entry(sbi, segno)->valid_blocks += del;
+}
+
+static void refresh_sit_entry(struct f2fs_sb_info *sbi,
+ block_t old_blkaddr, block_t new_blkaddr)
+{
+ update_sit_entry(sbi, new_blkaddr, 1);
+ if (GET_SEGNO(sbi, old_blkaddr) != NULL_SEGNO)
+ update_sit_entry(sbi, old_blkaddr, -1);
+}
+
+void invalidate_blocks(struct f2fs_sb_info *sbi, block_t addr)
+{
+ unsigned int segno = GET_SEGNO(sbi, addr);
+ struct sit_info *sit_i = SIT_I(sbi);
+
+ BUG_ON(addr == NULL_ADDR);
+ if (addr == NEW_ADDR)
+ return;
+
+ /* add it into sit main buffer */
+ mutex_lock(&sit_i->sentry_lock);
+
+ update_sit_entry(sbi, addr, -1);
+
+ /* add it into dirty seglist */
+ locate_dirty_segment(sbi, segno);
+
+ mutex_unlock(&sit_i->sentry_lock);
+}
+
+/**
+ * This function should be resided under the curseg_mutex lock
+ */
+static void __add_sum_entry(struct f2fs_sb_info *sbi, int type,
+ struct f2fs_summary *sum, unsigned short offset)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ void *addr = curseg->sum_blk;
+ addr += offset * sizeof(struct f2fs_summary);
+ memcpy(addr, sum, sizeof(struct f2fs_summary));
+ return;
+}
+
+/**
+ * Calculate the number of current summary pages for writing
+ */
+int npages_for_summary_flush(struct f2fs_sb_info *sbi)
+{
+ int total_size_bytes = 0;
+ int valid_sum_count = 0;
+ int i, sum_space;
+
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) {
+ if (sbi->ckpt->alloc_type[i] == SSR)
+ valid_sum_count += sbi->blocks_per_seg;
+ else
+ valid_sum_count += curseg_blkoff(sbi, i);
+ }
+
+ total_size_bytes = valid_sum_count * (SUMMARY_SIZE + 1)
+ + sizeof(struct nat_journal) + 2
+ + sizeof(struct sit_journal) + 2;
+ sum_space = PAGE_CACHE_SIZE - SUM_FOOTER_SIZE;
+ if (total_size_bytes < sum_space)
+ return 1;
+ else if (total_size_bytes < 2 * sum_space)
+ return 2;
+ return 3;
+}
+
+/**
+ * Caller should put this summary page
+ */
+struct page *get_sum_page(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ return get_meta_page(sbi, GET_SUM_BLOCK(sbi, segno));
+}
+
+static void write_sum_page(struct f2fs_sb_info *sbi,
+ struct f2fs_summary_block *sum_blk, block_t blk_addr)
+{
+ struct page *page = grab_meta_page(sbi, blk_addr);
+ void *kaddr = page_address(page);
+ memcpy(kaddr, sum_blk, PAGE_CACHE_SIZE);
+ set_page_dirty(page);
+ f2fs_put_page(page, 1);
+}
+
+static unsigned int check_prefree_segments(struct f2fs_sb_info *sbi,
+ int ofs_unit, int type)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned long *prefree_segmap = dirty_i->dirty_segmap[PRE];
+ unsigned int segno, next_segno, i;
+ int ofs = 0;
+
+ /*
+ * If there is not enough reserved sections,
+ * we should not reuse prefree segments.
+ */
+ if (has_not_enough_free_secs(sbi))
+ return NULL_SEGNO;
+
+ /*
+ * NODE page should not reuse prefree segment,
+ * since those information is used for SPOR.
+ */
+ if (IS_NODESEG(type))
+ return NULL_SEGNO;
+next:
+ segno = find_next_bit(prefree_segmap, TOTAL_SEGS(sbi), ofs++);
+ ofs = ((segno / ofs_unit) * ofs_unit) + ofs_unit;
+ if (segno < TOTAL_SEGS(sbi)) {
+ /* skip intermediate segments in a section */
+ if (segno % ofs_unit)
+ goto next;
+
+ /* skip if whole section is not prefree */
+ next_segno = find_next_zero_bit(prefree_segmap,
+ TOTAL_SEGS(sbi), segno + 1);
+ if (next_segno - segno < ofs_unit)
+ goto next;
+
+ /* skip if whole section was not free at the last checkpoint */
+ for (i = 0; i < ofs_unit; i++)
+ if (get_seg_entry(sbi, segno)->ckpt_valid_blocks)
+ goto next;
+ return segno;
+ }
+ return NULL_SEGNO;
+}
+
+/**
+ * Find a new segment from the free segments bitmap to right order
+ * This function should be returned with success, otherwise BUG
+ */
+static void get_new_segment(struct f2fs_sb_info *sbi,
+ unsigned int *newseg, bool new_sec, int dir)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int total_secs = sbi->total_sections;
+ unsigned int segno, secno, zoneno;
+ unsigned int total_zones = sbi->total_sections / sbi->secs_per_zone;
+ unsigned int hint = *newseg / sbi->segs_per_sec;
+ unsigned int old_zoneno = GET_ZONENO_FROM_SEGNO(sbi, *newseg);
+ unsigned int left_start = hint;
+ bool init = true;
+ int go_left = 0;
+ int i;
+
+ write_lock(&free_i->segmap_lock);
+
+ if (!new_sec && ((*newseg + 1) % sbi->segs_per_sec)) {
+ segno = find_next_zero_bit(free_i->free_segmap,
+ TOTAL_SEGS(sbi), *newseg + 1);
+ if (segno < TOTAL_SEGS(sbi))
+ goto got_it;
+ }
+find_other_zone:
+ secno = find_next_zero_bit(free_i->free_secmap, total_secs, hint);
+ if (secno >= total_secs) {
+ if (dir == ALLOC_RIGHT) {
+ secno = find_next_zero_bit(free_i->free_secmap,
+ total_secs, 0);
+ BUG_ON(secno >= total_secs);
+ } else {
+ go_left = 1;
+ left_start = hint - 1;
+ }
+ }
+ if (go_left == 0)
+ goto skip_left;
+
+ while (test_bit(left_start, free_i->free_secmap)) {
+ if (left_start > 0) {
+ left_start--;
+ continue;
+ }
+ left_start = find_next_zero_bit(free_i->free_secmap,
+ total_secs, 0);
+ BUG_ON(left_start >= total_secs);
+ break;
+ }
+ secno = left_start;
+skip_left:
+ hint = secno;
+ segno = secno * sbi->segs_per_sec;
+ zoneno = secno / sbi->secs_per_zone;
+
+ /* give up on finding another zone */
+ if (!init)
+ goto got_it;
+ if (sbi->secs_per_zone == 1)
+ goto got_it;
+ if (zoneno == old_zoneno)
+ goto got_it;
+ if (dir == ALLOC_LEFT) {
+ if (!go_left && zoneno + 1 >= total_zones)
+ goto got_it;
+ if (go_left && zoneno == 0)
+ goto got_it;
+ }
+ for (i = 0; i < NR_CURSEG_TYPE; i++)
+ if (CURSEG_I(sbi, i)->zone == zoneno)
+ break;
+
+ if (i < NR_CURSEG_TYPE) {
+ /* zone is in user, try another */
+ if (go_left)
+ hint = zoneno * sbi->secs_per_zone - 1;
+ else if (zoneno + 1 >= total_zones)
+ hint = 0;
+ else
+ hint = (zoneno + 1) * sbi->secs_per_zone;
+ init = false;
+ goto find_other_zone;
+ }
+got_it:
+ /* set it as dirty segment in free segmap */
+ BUG_ON(test_bit(segno, free_i->free_segmap));
+ __set_inuse(sbi, segno);
+ *newseg = segno;
+ write_unlock(&free_i->segmap_lock);
+}
+
+static void reset_curseg(struct f2fs_sb_info *sbi, int type, int modified)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ struct summary_footer *sum_footer;
+
+ curseg->segno = curseg->next_segno;
+ curseg->zone = GET_ZONENO_FROM_SEGNO(sbi, curseg->segno);
+ curseg->next_blkoff = 0;
+ curseg->next_segno = NULL_SEGNO;
+
+ sum_footer = &(curseg->sum_blk->footer);
+ memset(sum_footer, 0, sizeof(struct summary_footer));
+ if (IS_DATASEG(type))
+ SET_SUM_TYPE(sum_footer, SUM_TYPE_DATA);
+ if (IS_NODESEG(type))
+ SET_SUM_TYPE(sum_footer, SUM_TYPE_NODE);
+ __set_sit_entry_type(sbi, type, curseg->segno, modified);
+}
+
+/**
+ * Allocate a current working segment.
+ * This function always allocates a free segment in LFS manner.
+ */
+static void new_curseg(struct f2fs_sb_info *sbi, int type, bool new_sec)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ unsigned int segno = curseg->segno;
+ int dir = ALLOC_LEFT;
+
+ write_sum_page(sbi, curseg->sum_blk,
+ GET_SUM_BLOCK(sbi, curseg->segno));
+ if (type == CURSEG_WARM_DATA || type == CURSEG_COLD_DATA)
+ dir = ALLOC_RIGHT;
+
+ if (test_opt(sbi, NOHEAP))
+ dir = ALLOC_RIGHT;
+
+ get_new_segment(sbi, &segno, new_sec, dir);
+ curseg->next_segno = segno;
+ reset_curseg(sbi, type, 1);
+ curseg->alloc_type = LFS;
+}
+
+static void __next_free_blkoff(struct f2fs_sb_info *sbi,
+ struct curseg_info *seg, block_t start)
+{
+ struct seg_entry *se = get_seg_entry(sbi, seg->segno);
+ block_t ofs;
+ for (ofs = start; ofs < sbi->blocks_per_seg; ofs++) {
+ if (!f2fs_test_bit(ofs, se->ckpt_valid_map)
+ && !f2fs_test_bit(ofs, se->cur_valid_map))
+ break;
+ }
+ seg->next_blkoff = ofs;
+}
+
+/**
+ * If a segment is written by LFS manner, next block offset is just obtained
+ * by increasing the current block offset. However, if a segment is written by
+ * SSR manner, next block offset obtained by calling __next_free_blkoff
+ */
+static void __refresh_next_blkoff(struct f2fs_sb_info *sbi,
+ struct curseg_info *seg)
+{
+ if (seg->alloc_type == SSR)
+ __next_free_blkoff(sbi, seg, seg->next_blkoff + 1);
+ else
+ seg->next_blkoff++;
+}
+
+/**
+ * This function always allocates a used segment (from dirty seglist) by SSR
+ * manner, so it should recover the existing segment information of valid blocks
+ */
+static void change_curseg(struct f2fs_sb_info *sbi, int type, bool reuse)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ unsigned int new_segno = curseg->next_segno;
+ struct f2fs_summary_block *sum_node;
+ struct page *sum_page;
+
+ write_sum_page(sbi, curseg->sum_blk,
+ GET_SUM_BLOCK(sbi, curseg->segno));
+ __set_test_and_inuse(sbi, new_segno);
+
+ mutex_lock(&dirty_i->seglist_lock);
+ __remove_dirty_segment(sbi, new_segno, PRE);
+ __remove_dirty_segment(sbi, new_segno, DIRTY);
+ mutex_unlock(&dirty_i->seglist_lock);
+
+ reset_curseg(sbi, type, 1);
+ curseg->alloc_type = SSR;
+ __next_free_blkoff(sbi, curseg, 0);
+
+ if (reuse) {
+ sum_page = get_sum_page(sbi, new_segno);
+ sum_node = (struct f2fs_summary_block *)page_address(sum_page);
+ memcpy(curseg->sum_blk, sum_node, SUM_ENTRY_SIZE);
+ f2fs_put_page(sum_page, 1);
+ }
+}
+
+/*
+ * flush out current segment and replace it with new segment
+ * This function should be returned with success, otherwise BUG
+ */
+static void allocate_segment_by_default(struct f2fs_sb_info *sbi,
+ int type, bool force)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ unsigned int ofs_unit;
+
+ if (force) {
+ new_curseg(sbi, type, true);
+ goto out;
+ }
+
+ ofs_unit = need_SSR(sbi) ? 1 : sbi->segs_per_sec;
+ curseg->next_segno = check_prefree_segments(sbi, ofs_unit, type);
+
+ if (curseg->next_segno != NULL_SEGNO)
+ change_curseg(sbi, type, false);
+ else if (type == CURSEG_WARM_NODE)
+ new_curseg(sbi, type, false);
+ else if (need_SSR(sbi) && get_ssr_segment(sbi, type))
+ change_curseg(sbi, type, true);
+ else
+ new_curseg(sbi, type, false);
+out:
+ sbi->segment_count[curseg->alloc_type]++;
+}
+
+void allocate_new_segments(struct f2fs_sb_info *sbi)
+{
+ struct curseg_info *curseg;
+ unsigned int old_curseg;
+ int i;
+
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) {
+ curseg = CURSEG_I(sbi, i);
+ old_curseg = curseg->segno;
+ SIT_I(sbi)->s_ops->allocate_segment(sbi, i, true);
+ locate_dirty_segment(sbi, old_curseg);
+ }
+}
+
+static const struct segment_allocation default_salloc_ops = {
+ .allocate_segment = allocate_segment_by_default,
+};
+
+static void f2fs_end_io_write(struct bio *bio, int err)
+{
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+ struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+ struct bio_private *p = bio->bi_private;
+
+ do {
+ struct page *page = bvec->bv_page;
+
+ if (--bvec >= bio->bi_io_vec)
+ prefetchw(&bvec->bv_page->flags);
+ if (!uptodate) {
+ SetPageError(page);
+ if (page->mapping)
+ set_bit(AS_EIO, &page->mapping->flags);
+ p->sbi->ckpt->ckpt_flags |= CP_ERROR_FLAG;
+ set_page_dirty(page);
+ }
+ end_page_writeback(page);
+ dec_page_count(p->sbi, F2FS_WRITEBACK);
+ } while (bvec >= bio->bi_io_vec);
+
+ if (p->is_sync)
+ complete(p->wait);
+ kfree(p);
+ bio_put(bio);
+}
+
+struct bio *f2fs_bio_alloc(struct block_device *bdev, sector_t first_sector,
+ int nr_vecs, gfp_t gfp_flags)
+{
+ struct bio *bio;
+repeat:
+ /* allocate new bio */
+ bio = bio_alloc(gfp_flags, nr_vecs);
+
+ if (bio == NULL && (current->flags & PF_MEMALLOC)) {
+ while (!bio && (nr_vecs /= 2))
+ bio = bio_alloc(gfp_flags, nr_vecs);
+ }
+ if (bio) {
+ bio->bi_bdev = bdev;
+ bio->bi_sector = first_sector;
+retry:
+ bio->bi_private = kmalloc(sizeof(struct bio_private),
+ GFP_NOFS | __GFP_HIGH);
+ if (!bio->bi_private) {
+ cond_resched();
+ goto retry;
+ }
+ }
+ if (bio == NULL) {
+ cond_resched();
+ goto repeat;
+ }
+ return bio;
+}
+
+static void do_submit_bio(struct f2fs_sb_info *sbi,
+ enum page_type type, bool sync)
+{
+ int rw = sync ? WRITE_SYNC : WRITE;
+ enum page_type btype = type > META ? META : type;
+
+ if (type >= META_FLUSH)
+ rw = WRITE_FLUSH_FUA;
+
+ if (sbi->bio[btype]) {
+ struct bio_private *p = sbi->bio[btype]->bi_private;
+ p->sbi = sbi;
+ sbi->bio[btype]->bi_end_io = f2fs_end_io_write;
+ if (type == META_FLUSH) {
+ DECLARE_COMPLETION_ONSTACK(wait);
+ p->is_sync = true;
+ p->wait = &wait;
+ submit_bio(rw, sbi->bio[btype]);
+ wait_for_completion(&wait);
+ } else {
+ p->is_sync = false;
+ submit_bio(rw, sbi->bio[btype]);
+ }
+ sbi->bio[btype] = NULL;
+ }
+}
+
+void f2fs_submit_bio(struct f2fs_sb_info *sbi, enum page_type type, bool sync)
+{
+ down_write(&sbi->bio_sem);
+ do_submit_bio(sbi, type, sync);
+ up_write(&sbi->bio_sem);
+}
+
+static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page,
+ block_t blk_addr, enum page_type type)
+{
+ struct block_device *bdev = sbi->sb->s_bdev;
+
+ verify_block_addr(sbi, blk_addr);
+
+ down_write(&sbi->bio_sem);
+
+ inc_page_count(sbi, F2FS_WRITEBACK);
+
+ if (sbi->bio[type] && sbi->last_block_in_bio[type] != blk_addr - 1)
+ do_submit_bio(sbi, type, false);
+alloc_new:
+ if (sbi->bio[type] == NULL)
+ sbi->bio[type] = f2fs_bio_alloc(bdev,
+ blk_addr << (sbi->log_blocksize - 9),
+ bio_get_nr_vecs(bdev), GFP_NOFS | __GFP_HIGH);
+
+ if (bio_add_page(sbi->bio[type], page, PAGE_CACHE_SIZE, 0) <
+ PAGE_CACHE_SIZE) {
+ do_submit_bio(sbi, type, false);
+ goto alloc_new;
+ }
+
+ sbi->last_block_in_bio[type] = blk_addr;
+
+ up_write(&sbi->bio_sem);
+}
+
+static bool __has_curseg_space(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ if (curseg->next_blkoff < sbi->blocks_per_seg)
+ return true;
+ return false;
+}
+
+static int __get_segment_type_2(struct page *page, enum page_type p_type)
+{
+ if (p_type == DATA)
+ return CURSEG_HOT_DATA;
+ else
+ return CURSEG_HOT_NODE;
+}
+
+static int __get_segment_type_4(struct page *page, enum page_type p_type)
+{
+ if (p_type == DATA) {
+ struct inode *inode = page->mapping->host;
+
+ if (S_ISDIR(inode->i_mode))
+ return CURSEG_HOT_DATA;
+ else
+ return CURSEG_COLD_DATA;
+ } else {
+ if (IS_DNODE(page) && !is_cold_node(page))
+ return CURSEG_HOT_NODE;
+ else
+ return CURSEG_COLD_NODE;
+ }
+}
+
+static int __get_segment_type_6(struct page *page, enum page_type p_type)
+{
+ if (p_type == DATA) {
+ struct inode *inode = page->mapping->host;
+
+ if (S_ISDIR(inode->i_mode))
+ return CURSEG_HOT_DATA;
+ else if (is_cold_data(page) || is_cold_file(inode))
+ return CURSEG_COLD_DATA;
+ else
+ return CURSEG_WARM_DATA;
+ } else {
+ if (IS_DNODE(page))
+ return is_cold_node(page) ? CURSEG_WARM_NODE :
+ CURSEG_HOT_NODE;
+ else
+ return CURSEG_COLD_NODE;
+ }
+}
+
+static int __get_segment_type(struct page *page, enum page_type p_type)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ switch (sbi->active_logs) {
+ case 2:
+ return __get_segment_type_2(page, p_type);
+ case 4:
+ return __get_segment_type_4(page, p_type);
+ case 6:
+ return __get_segment_type_6(page, p_type);
+ default:
+ BUG();
+ }
+}
+
+static void do_write_page(struct f2fs_sb_info *sbi, struct page *page,
+ block_t old_blkaddr, block_t *new_blkaddr,
+ struct f2fs_summary *sum, enum page_type p_type)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ struct curseg_info *curseg;
+ unsigned int old_cursegno;
+ int type;
+
+ type = __get_segment_type(page, p_type);
+ curseg = CURSEG_I(sbi, type);
+
+ mutex_lock(&curseg->curseg_mutex);
+
+ *new_blkaddr = NEXT_FREE_BLKADDR(sbi, curseg);
+ old_cursegno = curseg->segno;
+
+ /*
+ * __add_sum_entry should be resided under the curseg_mutex
+ * because, this function updates a summary entry in the
+ * current summary block.
+ */
+ __add_sum_entry(sbi, type, sum, curseg->next_blkoff);
+
+ mutex_lock(&sit_i->sentry_lock);
+ __refresh_next_blkoff(sbi, curseg);
+ sbi->block_count[curseg->alloc_type]++;
+
+ /*
+ * SIT information should be updated before segment allocation,
+ * since SSR needs latest valid block information.
+ */
+ refresh_sit_entry(sbi, old_blkaddr, *new_blkaddr);
+
+ if (!__has_curseg_space(sbi, type))
+ sit_i->s_ops->allocate_segment(sbi, type, false);
+
+ locate_dirty_segment(sbi, old_cursegno);
+ locate_dirty_segment(sbi, GET_SEGNO(sbi, old_blkaddr));
+ mutex_unlock(&sit_i->sentry_lock);
+
+ if (p_type == NODE)
+ fill_node_footer_blkaddr(page, NEXT_FREE_BLKADDR(sbi, curseg));
+
+ /* writeout dirty page into bdev */
+ submit_write_page(sbi, page, *new_blkaddr, p_type);
+
+ mutex_unlock(&curseg->curseg_mutex);
+}
+
+int write_meta_page(struct f2fs_sb_info *sbi, struct page *page,
+ struct writeback_control *wbc)
+{
+ if (wbc->for_reclaim)
+ return AOP_WRITEPAGE_ACTIVATE;
+
+ set_page_writeback(page);
+ submit_write_page(sbi, page, page->index, META);
+ return 0;
+}
+
+void write_node_page(struct f2fs_sb_info *sbi, struct page *page,
+ unsigned int nid, block_t old_blkaddr, block_t *new_blkaddr)
+{
+ struct f2fs_summary sum;
+ set_summary(&sum, nid, 0, 0);
+ do_write_page(sbi, page, old_blkaddr, new_blkaddr, &sum, NODE);
+}
+
+void write_data_page(struct inode *inode, struct page *page,
+ struct dnode_of_data *dn, block_t old_blkaddr,
+ block_t *new_blkaddr)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ struct f2fs_summary sum;
+ struct node_info ni;
+
+ BUG_ON(old_blkaddr == NULL_ADDR);
+ get_node_info(sbi, dn->nid, &ni);
+ set_summary(&sum, dn->nid, dn->ofs_in_node, ni.version);
+
+ do_write_page(sbi, page, old_blkaddr,
+ new_blkaddr, &sum, DATA);
+}
+
+void rewrite_data_page(struct f2fs_sb_info *sbi, struct page *page,
+ block_t old_blk_addr)
+{
+ submit_write_page(sbi, page, old_blk_addr, DATA);
+}
+
+void recover_data_page(struct f2fs_sb_info *sbi,
+ struct page *page, struct f2fs_summary *sum,
+ block_t old_blkaddr, block_t new_blkaddr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ struct curseg_info *curseg;
+ unsigned int segno, old_cursegno;
+ struct seg_entry *se;
+ int type;
+
+ segno = GET_SEGNO(sbi, new_blkaddr);
+ se = get_seg_entry(sbi, segno);
+ type = se->type;
+
+ if (se->valid_blocks == 0 && !IS_CURSEG(sbi, segno)) {
+ if (old_blkaddr == NULL_ADDR)
+ type = CURSEG_COLD_DATA;
+ else
+ type = CURSEG_WARM_DATA;
+ }
+ curseg = CURSEG_I(sbi, type);
+
+ mutex_lock(&curseg->curseg_mutex);
+ mutex_lock(&sit_i->sentry_lock);
+
+ old_cursegno = curseg->segno;
+
+ /* change the current segment */
+ if (segno != curseg->segno) {
+ curseg->next_segno = segno;
+ change_curseg(sbi, type, true);
+ }
+
+ curseg->next_blkoff = GET_SEGOFF_FROM_SEG0(sbi, new_blkaddr) &
+ (sbi->blocks_per_seg - 1);
+ __add_sum_entry(sbi, type, sum, curseg->next_blkoff);
+
+ refresh_sit_entry(sbi, old_blkaddr, new_blkaddr);
+
+ locate_dirty_segment(sbi, old_cursegno);
+ locate_dirty_segment(sbi, GET_SEGNO(sbi, old_blkaddr));
+
+ mutex_unlock(&sit_i->sentry_lock);
+ mutex_unlock(&curseg->curseg_mutex);
+}
+
+void rewrite_node_page(struct f2fs_sb_info *sbi,
+ struct page *page, struct f2fs_summary *sum,
+ block_t old_blkaddr, block_t new_blkaddr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ int type = CURSEG_WARM_NODE;
+ struct curseg_info *curseg;
+ unsigned int segno, old_cursegno;
+ block_t next_blkaddr = next_blkaddr_of_node(page);
+ unsigned int next_segno = GET_SEGNO(sbi, next_blkaddr);
+
+ curseg = CURSEG_I(sbi, type);
+
+ mutex_lock(&curseg->curseg_mutex);
+ mutex_lock(&sit_i->sentry_lock);
+
+ segno = GET_SEGNO(sbi, new_blkaddr);
+ old_cursegno = curseg->segno;
+
+ /* change the current segment */
+ if (segno != curseg->segno) {
+ curseg->next_segno = segno;
+ change_curseg(sbi, type, true);
+ }
+ curseg->next_blkoff = GET_SEGOFF_FROM_SEG0(sbi, new_blkaddr) &
+ (sbi->blocks_per_seg - 1);
+ __add_sum_entry(sbi, type, sum, curseg->next_blkoff);
+
+ /* change the current log to the next block addr in advance */
+ if (next_segno != segno) {
+ curseg->next_segno = next_segno;
+ change_curseg(sbi, type, true);
+ }
+ curseg->next_blkoff = GET_SEGOFF_FROM_SEG0(sbi, next_blkaddr) &
+ (sbi->blocks_per_seg - 1);
+
+ /* rewrite node page */
+ set_page_writeback(page);
+ submit_write_page(sbi, page, new_blkaddr, NODE);
+ f2fs_submit_bio(sbi, NODE, true);
+ refresh_sit_entry(sbi, old_blkaddr, new_blkaddr);
+
+ locate_dirty_segment(sbi, old_cursegno);
+ locate_dirty_segment(sbi, GET_SEGNO(sbi, old_blkaddr));
+
+ mutex_unlock(&sit_i->sentry_lock);
+ mutex_unlock(&curseg->curseg_mutex);
+}
+
+static int read_compacted_summaries(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ struct curseg_info *seg_i;
+ unsigned char *kaddr;
+ struct page *page;
+ block_t start;
+ int i, j, offset;
+
+ start = start_sum_block(sbi);
+
+ page = get_meta_page(sbi, start++);
+ kaddr = (unsigned char *)page_address(page);
+
+ /* Step 1: restore nat cache */
+ seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ memcpy(&seg_i->sum_blk->n_nats, kaddr, SUM_JOURNAL_SIZE);
+
+ /* Step 2: restore sit cache */
+ seg_i = CURSEG_I(sbi, CURSEG_COLD_DATA);
+ memcpy(&seg_i->sum_blk->n_sits, kaddr + SUM_JOURNAL_SIZE,
+ SUM_JOURNAL_SIZE);
+ offset = 2 * SUM_JOURNAL_SIZE;
+
+ /* Step 3: restore summary entries */
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) {
+ unsigned short blk_off;
+ unsigned int segno;
+
+ seg_i = CURSEG_I(sbi, i);
+ segno = le32_to_cpu(ckpt->cur_data_segno[i]);
+ blk_off = le16_to_cpu(ckpt->cur_data_blkoff[i]);
+ seg_i->next_segno = segno;
+ reset_curseg(sbi, i, 0);
+ seg_i->alloc_type = ckpt->alloc_type[i];
+ seg_i->next_blkoff = blk_off;
+
+ if (seg_i->alloc_type == SSR)
+ blk_off = sbi->blocks_per_seg;
+
+ for (j = 0; j < blk_off; j++) {
+ struct f2fs_summary *s;
+ s = (struct f2fs_summary *)(kaddr + offset);
+ seg_i->sum_blk->entries[j] = *s;
+ offset += SUMMARY_SIZE;
+ if (offset + SUMMARY_SIZE <= PAGE_CACHE_SIZE -
+ SUM_FOOTER_SIZE)
+ continue;
+
+ f2fs_put_page(page, 1);
+ page = NULL;
+
+ page = get_meta_page(sbi, start++);
+ kaddr = (unsigned char *)page_address(page);
+ offset = 0;
+ }
+ }
+ f2fs_put_page(page, 1);
+ return 0;
+}
+
+static int read_normal_summaries(struct f2fs_sb_info *sbi, int type)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ struct f2fs_summary_block *sum;
+ struct curseg_info *curseg;
+ struct page *new;
+ unsigned short blk_off;
+ unsigned int segno = 0;
+ block_t blk_addr = 0;
+
+ /* get segment number and block addr */
+ if (IS_DATASEG(type)) {
+ segno = le32_to_cpu(ckpt->cur_data_segno[type]);
+ blk_off = le16_to_cpu(ckpt->cur_data_blkoff[type -
+ CURSEG_HOT_DATA]);
+ if (ckpt->ckpt_flags & CP_UMOUNT_FLAG)
+ blk_addr = sum_blk_addr(sbi, NR_CURSEG_TYPE, type);
+ else
+ blk_addr = sum_blk_addr(sbi, NR_CURSEG_DATA_TYPE, type);
+ } else {
+ segno = le32_to_cpu(ckpt->cur_node_segno[type -
+ CURSEG_HOT_NODE]);
+ blk_off = le16_to_cpu(ckpt->cur_node_blkoff[type -
+ CURSEG_HOT_NODE]);
+ if (ckpt->ckpt_flags & CP_UMOUNT_FLAG)
+ blk_addr = sum_blk_addr(sbi, NR_CURSEG_NODE_TYPE,
+ type - CURSEG_HOT_NODE);
+ else
+ blk_addr = GET_SUM_BLOCK(sbi, segno);
+ }
+
+ new = get_meta_page(sbi, blk_addr);
+ sum = (struct f2fs_summary_block *)page_address(new);
+
+ if (IS_NODESEG(type)) {
+ if (ckpt->ckpt_flags & CP_UMOUNT_FLAG) {
+ struct f2fs_summary *ns = &sum->entries[0];
+ int i;
+ for (i = 0; i < sbi->blocks_per_seg; i++, ns++) {
+ ns->version = 0;
+ ns->ofs_in_node = 0;
+ }
+ } else {
+ if (restore_node_summary(sbi, segno, sum)) {
+ f2fs_put_page(new, 1);
+ return -EINVAL;
+ }
+ }
+ }
+
+ /* set uncompleted segment to curseg */
+ curseg = CURSEG_I(sbi, type);
+ mutex_lock(&curseg->curseg_mutex);
+ memcpy(curseg->sum_blk, sum, PAGE_CACHE_SIZE);
+ curseg->next_segno = segno;
+ reset_curseg(sbi, type, 0);
+ curseg->alloc_type = ckpt->alloc_type[type];
+ curseg->next_blkoff = blk_off;
+ mutex_unlock(&curseg->curseg_mutex);
+ f2fs_put_page(new, 1);
+ return 0;
+}
+
+static int restore_curseg_summaries(struct f2fs_sb_info *sbi)
+{
+ int type = CURSEG_HOT_DATA;
+
+ if (sbi->ckpt->ckpt_flags & CP_COMPACT_SUM_FLAG) {
+ /* restore for compacted data summary */
+ if (read_compacted_summaries(sbi))
+ return -EINVAL;
+ type = CURSEG_HOT_NODE;
+ }
+
+ for (; type <= CURSEG_COLD_NODE; type++)
+ if (read_normal_summaries(sbi, type))
+ return -EINVAL;
+ return 0;
+}
+
+static void write_compacted_summaries(struct f2fs_sb_info *sbi, block_t blkaddr)
+{
+ struct page *page;
+ unsigned char *kaddr;
+ struct f2fs_summary *summary;
+ struct curseg_info *seg_i;
+ int written_size = 0;
+ int i, j;
+
+ page = grab_meta_page(sbi, blkaddr++);
+ kaddr = (unsigned char *)page_address(page);
+
+ /* Step 1: write nat cache */
+ seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA);
+ memcpy(kaddr, &seg_i->sum_blk->n_nats, SUM_JOURNAL_SIZE);
+ written_size += SUM_JOURNAL_SIZE;
+
+ /* Step 2: write sit cache */
+ seg_i = CURSEG_I(sbi, CURSEG_COLD_DATA);
+ memcpy(kaddr + written_size, &seg_i->sum_blk->n_sits,
+ SUM_JOURNAL_SIZE);
+ written_size += SUM_JOURNAL_SIZE;
+
+ set_page_dirty(page);
+
+ /* Step 3: write summary entries */
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) {
+ unsigned short blkoff;
+ seg_i = CURSEG_I(sbi, i);
+ if (sbi->ckpt->alloc_type[i] == SSR)
+ blkoff = sbi->blocks_per_seg;
+ else
+ blkoff = curseg_blkoff(sbi, i);
+
+ for (j = 0; j < blkoff; j++) {
+ if (!page) {
+ page = grab_meta_page(sbi, blkaddr++);
+ kaddr = (unsigned char *)page_address(page);
+ written_size = 0;
+ }
+ summary = (struct f2fs_summary *)(kaddr + written_size);
+ *summary = seg_i->sum_blk->entries[j];
+ written_size += SUMMARY_SIZE;
+ set_page_dirty(page);
+
+ if (written_size + SUMMARY_SIZE <= PAGE_CACHE_SIZE -
+ SUM_FOOTER_SIZE)
+ continue;
+
+ f2fs_put_page(page, 1);
+ page = NULL;
+ }
+ }
+ if (page)
+ f2fs_put_page(page, 1);
+}
+
+static void write_normal_summaries(struct f2fs_sb_info *sbi,
+ block_t blkaddr, int type)
+{
+ int i, end;
+ if (IS_DATASEG(type))
+ end = type + NR_CURSEG_DATA_TYPE;
+ else
+ end = type + NR_CURSEG_NODE_TYPE;
+
+ for (i = type; i < end; i++) {
+ struct curseg_info *sum = CURSEG_I(sbi, i);
+ mutex_lock(&sum->curseg_mutex);
+ write_sum_page(sbi, sum->sum_blk, blkaddr + (i - type));
+ mutex_unlock(&sum->curseg_mutex);
+ }
+}
+
+void write_data_summaries(struct f2fs_sb_info *sbi, block_t start_blk)
+{
+ if (sbi->ckpt->ckpt_flags & CP_COMPACT_SUM_FLAG)
+ write_compacted_summaries(sbi, start_blk);
+ else
+ write_normal_summaries(sbi, start_blk, CURSEG_HOT_DATA);
+}
+
+void write_node_summaries(struct f2fs_sb_info *sbi, block_t start_blk)
+{
+ if (sbi->ckpt->ckpt_flags & CP_UMOUNT_FLAG)
+ write_normal_summaries(sbi, start_blk, CURSEG_HOT_NODE);
+ return;
+}
+
+int lookup_journal_in_cursum(struct f2fs_summary_block *sum, int type,
+ unsigned int val, int alloc)
+{
+ int i;
+
+ if (type == NAT_JOURNAL) {
+ for (i = 0; i < nats_in_cursum(sum); i++) {
+ if (le32_to_cpu(nid_in_journal(sum, i)) == val)
+ return i;
+ }
+ if (alloc && nats_in_cursum(sum) < NAT_JOURNAL_ENTRIES)
+ return update_nats_in_cursum(sum, 1);
+ } else if (type == SIT_JOURNAL) {
+ for (i = 0; i < sits_in_cursum(sum); i++)
+ if (le32_to_cpu(segno_in_journal(sum, i)) == val)
+ return i;
+ if (alloc && sits_in_cursum(sum) < SIT_JOURNAL_ENTRIES)
+ return update_sits_in_cursum(sum, 1);
+ }
+ return -1;
+}
+
+static struct page *get_current_sit_page(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int offset = SIT_BLOCK_OFFSET(sit_i, segno);
+ block_t blk_addr = sit_i->sit_base_addr + offset;
+
+ check_seg_range(sbi, segno);
+
+ /* calculate sit block address */
+ if (f2fs_test_bit(offset, sit_i->sit_bitmap))
+ blk_addr += sit_i->sit_blocks;
+
+ return get_meta_page(sbi, blk_addr);
+}
+
+static struct page *get_next_sit_page(struct f2fs_sb_info *sbi,
+ unsigned int start)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ struct page *src_page, *dst_page;
+ pgoff_t src_off, dst_off;
+ void *src_addr, *dst_addr;
+
+ src_off = current_sit_addr(sbi, start);
+ dst_off = next_sit_addr(sbi, src_off);
+
+ /* get current sit block page without lock */
+ src_page = get_meta_page(sbi, src_off);
+ dst_page = grab_meta_page(sbi, dst_off);
+ BUG_ON(PageDirty(src_page));
+
+ src_addr = page_address(src_page);
+ dst_addr = page_address(dst_page);
+ memcpy(dst_addr, src_addr, PAGE_CACHE_SIZE);
+
+ set_page_dirty(dst_page);
+ f2fs_put_page(src_page, 1);
+
+ set_to_next_sit(sit_i, start);
+
+ return dst_page;
+}
+
+static bool flush_sits_in_journal(struct f2fs_sb_info *sbi)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ int i;
+
+ /*
+ * If the journal area in the current summary is full of sit entries,
+ * all the sit entries will be flushed. Otherwise the sit entries
+ * are not able to replace with newly hot sit entries.
+ */
+ if (sits_in_cursum(sum) >= SIT_JOURNAL_ENTRIES) {
+ for (i = sits_in_cursum(sum) - 1; i >= 0; i--) {
+ unsigned int segno;
+ segno = le32_to_cpu(segno_in_journal(sum, i));
+ __mark_sit_entry_dirty(sbi, segno);
+ }
+ update_sits_in_cursum(sum, -sits_in_cursum(sum));
+ return 1;
+ }
+ return 0;
+}
+
+/**
+ * CP calls this function, which flushes SIT entries including sit_journal,
+ * and moves prefree segs to free segs.
+ */
+void flush_sit_entries(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned long *bitmap = sit_i->dirty_sentries_bitmap;
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ unsigned long nsegs = TOTAL_SEGS(sbi);
+ struct page *page = NULL;
+ struct f2fs_sit_block *raw_sit = NULL;
+ unsigned int start = 0, end = 0;
+ unsigned int segno = -1;
+ bool flushed;
+
+ mutex_lock(&curseg->curseg_mutex);
+ mutex_lock(&sit_i->sentry_lock);
+
+ /*
+ * "flushed" indicates whether sit entries in journal are flushed
+ * to the SIT area or not.
+ */
+ flushed = flush_sits_in_journal(sbi);
+
+ while ((segno = find_next_bit(bitmap, nsegs, segno + 1)) < nsegs) {
+ struct seg_entry *se = get_seg_entry(sbi, segno);
+ int sit_offset, offset;
+
+ sit_offset = SIT_ENTRY_OFFSET(sit_i, segno);
+
+ if (flushed)
+ goto to_sit_page;
+
+ offset = lookup_journal_in_cursum(sum, SIT_JOURNAL, segno, 1);
+ if (offset >= 0) {
+ segno_in_journal(sum, offset) = cpu_to_le32(segno);
+ seg_info_to_raw_sit(se, &sit_in_journal(sum, offset));
+ goto flush_done;
+ }
+to_sit_page:
+ if (!page || (start > segno) || (segno > end)) {
+ if (page) {
+ f2fs_put_page(page, 1);
+ page = NULL;
+ }
+
+ start = START_SEGNO(sit_i, segno);
+ end = start + SIT_ENTRY_PER_BLOCK - 1;
+
+ /* read sit block that will be updated */
+ page = get_next_sit_page(sbi, start);
+ raw_sit = page_address(page);
+ }
+
+ /* udpate entry in SIT block */
+ seg_info_to_raw_sit(se, &raw_sit->entries[sit_offset]);
+flush_done:
+ __clear_bit(segno, bitmap);
+ sit_i->dirty_sentries--;
+ }
+ mutex_unlock(&sit_i->sentry_lock);
+ mutex_unlock(&curseg->curseg_mutex);
+
+ /* writeout last modified SIT block */
+ f2fs_put_page(page, 1);
+
+ set_prefree_as_free_segments(sbi);
+}
+
+static int build_sit_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi);
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ struct sit_info *sit_i;
+ unsigned int sit_segs, start;
+ char *src_bitmap, *dst_bitmap;
+ unsigned int bitmap_size;
+
+ /* allocate memory for SIT information */
+ sit_i = kzalloc(sizeof(struct sit_info), GFP_KERNEL);
+ if (!sit_i)
+ return -ENOMEM;
+
+ SM_I(sbi)->sit_info = sit_i;
+
+ sit_i->sentries = vzalloc(TOTAL_SEGS(sbi) * sizeof(struct seg_entry));
+ if (!sit_i->sentries)
+ return -ENOMEM;
+
+ bitmap_size = f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ sit_i->dirty_sentries_bitmap = kzalloc(bitmap_size, GFP_KERNEL);
+ if (!sit_i->dirty_sentries_bitmap)
+ return -ENOMEM;
+
+ for (start = 0; start < TOTAL_SEGS(sbi); start++) {
+ sit_i->sentries[start].cur_valid_map
+ = kzalloc(SIT_VBLOCK_MAP_SIZE, GFP_KERNEL);
+ sit_i->sentries[start].ckpt_valid_map
+ = kzalloc(SIT_VBLOCK_MAP_SIZE, GFP_KERNEL);
+ if (!sit_i->sentries[start].cur_valid_map
+ || !sit_i->sentries[start].ckpt_valid_map)
+ return -ENOMEM;
+ }
+
+ if (sbi->segs_per_sec > 1) {
+ sit_i->sec_entries = vzalloc(sbi->total_sections *
+ sizeof(struct sec_entry));
+ if (!sit_i->sec_entries)
+ return -ENOMEM;
+ }
+
+ /* get information related with SIT */
+ sit_segs = le32_to_cpu(raw_super->segment_count_sit) >> 1;
+
+ /* setup SIT bitmap from ckeckpoint pack */
+ bitmap_size = __bitmap_size(sbi, SIT_BITMAP);
+ src_bitmap = __bitmap_ptr(sbi, SIT_BITMAP);
+
+ dst_bitmap = kzalloc(bitmap_size, GFP_KERNEL);
+ if (!dst_bitmap)
+ return -ENOMEM;
+ memcpy(dst_bitmap, src_bitmap, bitmap_size);
+
+ /* init SIT information */
+ sit_i->s_ops = &default_salloc_ops;
+
+ sit_i->sit_base_addr = le32_to_cpu(raw_super->sit_blkaddr);
+ sit_i->sit_blocks = sit_segs << sbi->log_blocks_per_seg;
+ sit_i->written_valid_blocks = le64_to_cpu(ckpt->valid_block_count);
+ sit_i->sit_bitmap = dst_bitmap;
+ sit_i->bitmap_size = bitmap_size;
+ sit_i->dirty_sentries = 0;
+ sit_i->sents_per_block = SIT_ENTRY_PER_BLOCK;
+ sit_i->elapsed_time = le64_to_cpu(sbi->ckpt->elapsed_time);
+ sit_i->mounted_time = CURRENT_TIME_SEC.tv_sec;
+ mutex_init(&sit_i->sentry_lock);
+ return 0;
+}
+
+static int build_free_segmap(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ struct free_segmap_info *free_i;
+ unsigned int bitmap_size, sec_bitmap_size;
+
+ /* allocate memory for free segmap information */
+ free_i = kzalloc(sizeof(struct free_segmap_info), GFP_KERNEL);
+ if (!free_i)
+ return -ENOMEM;
+
+ SM_I(sbi)->free_info = free_i;
+
+ bitmap_size = f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ free_i->free_segmap = kmalloc(bitmap_size, GFP_KERNEL);
+ if (!free_i->free_segmap)
+ return -ENOMEM;
+
+ sec_bitmap_size = f2fs_bitmap_size(sbi->total_sections);
+ free_i->free_secmap = kmalloc(sec_bitmap_size, GFP_KERNEL);
+ if (!free_i->free_secmap)
+ return -ENOMEM;
+
+ /* set all segments as dirty temporarily */
+ memset(free_i->free_segmap, 0xff, bitmap_size);
+ memset(free_i->free_secmap, 0xff, sec_bitmap_size);
+
+ /* init free segmap information */
+ free_i->start_segno =
+ (unsigned int) GET_SEGNO_FROM_SEG0(sbi, sm_info->main_blkaddr);
+ free_i->free_segments = 0;
+ free_i->free_sections = 0;
+ rwlock_init(&free_i->segmap_lock);
+ return 0;
+}
+
+static int build_curseg(struct f2fs_sb_info *sbi)
+{
+ struct curseg_info *array = NULL;
+ int i;
+
+ array = kzalloc(sizeof(*array) * NR_CURSEG_TYPE, GFP_KERNEL);
+ if (!array)
+ return -ENOMEM;
+
+ SM_I(sbi)->curseg_array = array;
+
+ for (i = 0; i < NR_CURSEG_TYPE; i++) {
+ mutex_init(&array[i].curseg_mutex);
+ array[i].sum_blk = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
+ if (!array[i].sum_blk)
+ return -ENOMEM;
+ array[i].segno = NULL_SEGNO;
+ array[i].next_blkoff = 0;
+ }
+ return restore_curseg_summaries(sbi);
+}
+
+static void build_sit_entries(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA);
+ struct f2fs_summary_block *sum = curseg->sum_blk;
+ unsigned int start;
+
+ for (start = 0; start < TOTAL_SEGS(sbi); start++) {
+ struct seg_entry *se = &sit_i->sentries[start];
+ struct f2fs_sit_block *sit_blk;
+ struct f2fs_sit_entry sit;
+ struct page *page;
+ int i;
+
+ mutex_lock(&curseg->curseg_mutex);
+ for (i = 0; i < sits_in_cursum(sum); i++) {
+ if (le32_to_cpu(segno_in_journal(sum, i)) == start) {
+ sit = sit_in_journal(sum, i);
+ mutex_unlock(&curseg->curseg_mutex);
+ goto got_it;
+ }
+ }
+ mutex_unlock(&curseg->curseg_mutex);
+ page = get_current_sit_page(sbi, start);
+ sit_blk = (struct f2fs_sit_block *)page_address(page);
+ sit = sit_blk->entries[SIT_ENTRY_OFFSET(sit_i, start)];
+ f2fs_put_page(page, 1);
+got_it:
+ check_block_count(sbi, start, &sit);
+ seg_info_from_raw_sit(se, &sit);
+ if (sbi->segs_per_sec > 1) {
+ struct sec_entry *e = get_sec_entry(sbi, start);
+ e->valid_blocks += se->valid_blocks;
+ }
+ }
+}
+
+static void init_free_segmap(struct f2fs_sb_info *sbi)
+{
+ unsigned int start;
+ int type;
+
+ for (start = 0; start < TOTAL_SEGS(sbi); start++) {
+ struct seg_entry *sentry = get_seg_entry(sbi, start);
+ if (!sentry->valid_blocks)
+ __set_free(sbi, start);
+ }
+
+ /* set use the current segments */
+ for (type = CURSEG_HOT_DATA; type <= CURSEG_COLD_NODE; type++) {
+ struct curseg_info *curseg_t = CURSEG_I(sbi, type);
+ __set_test_and_inuse(sbi, curseg_t->segno);
+ }
+}
+
+static void init_dirty_segmap(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int segno = 0, offset = 0;
+ unsigned short valid_blocks;
+
+ while (segno < TOTAL_SEGS(sbi)) {
+ /* find dirty segment based on free segmap */
+ segno = find_next_inuse(free_i, TOTAL_SEGS(sbi), offset);
+ if (segno >= TOTAL_SEGS(sbi))
+ break;
+ offset = segno + 1;
+ valid_blocks = get_valid_blocks(sbi, segno, 0);
+ if (valid_blocks >= sbi->blocks_per_seg || !valid_blocks)
+ continue;
+ mutex_lock(&dirty_i->seglist_lock);
+ __locate_dirty_segment(sbi, segno, DIRTY);
+ mutex_unlock(&dirty_i->seglist_lock);
+ }
+}
+
+static int init_victim_segmap(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ unsigned int bitmap_size = f2fs_bitmap_size(TOTAL_SEGS(sbi));
+
+ dirty_i->victim_segmap[FG_GC] = kzalloc(bitmap_size, GFP_KERNEL);
+ dirty_i->victim_segmap[BG_GC] = kzalloc(bitmap_size, GFP_KERNEL);
+ if (!dirty_i->victim_segmap[FG_GC] || !dirty_i->victim_segmap[BG_GC])
+ return -ENOMEM;
+ return 0;
+}
+
+static int build_dirty_segmap(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i;
+ unsigned int bitmap_size, i;
+
+ /* allocate memory for dirty segments list information */
+ dirty_i = kzalloc(sizeof(struct dirty_seglist_info), GFP_KERNEL);
+ if (!dirty_i)
+ return -ENOMEM;
+
+ SM_I(sbi)->dirty_info = dirty_i;
+ mutex_init(&dirty_i->seglist_lock);
+
+ bitmap_size = f2fs_bitmap_size(TOTAL_SEGS(sbi));
+
+ for (i = 0; i < NR_DIRTY_TYPE; i++) {
+ dirty_i->dirty_segmap[i] = kzalloc(bitmap_size, GFP_KERNEL);
+ dirty_i->nr_dirty[i] = 0;
+ if (!dirty_i->dirty_segmap[i])
+ return -ENOMEM;
+ }
+
+ init_dirty_segmap(sbi);
+ return init_victim_segmap(sbi);
+}
+
+/**
+ * Update min, max modified time for cost-benefit GC algorithm
+ */
+static void init_min_max_mtime(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int segno;
+
+ mutex_lock(&sit_i->sentry_lock);
+
+ sit_i->min_mtime = LLONG_MAX;
+
+ for (segno = 0; segno < TOTAL_SEGS(sbi); segno += sbi->segs_per_sec) {
+ unsigned int i;
+ unsigned long long mtime = 0;
+
+ for (i = 0; i < sbi->segs_per_sec; i++)
+ mtime += get_seg_entry(sbi, segno + i)->mtime;
+
+ mtime = div_u64(mtime, sbi->segs_per_sec);
+
+ if (sit_i->min_mtime > mtime)
+ sit_i->min_mtime = mtime;
+ }
+ sit_i->max_mtime = get_mtime(sbi);
+ mutex_unlock(&sit_i->sentry_lock);
+}
+
+int build_segment_manager(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi);
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ struct f2fs_sm_info *sm_info = NULL;
+ int err;
+
+ sm_info = kzalloc(sizeof(struct f2fs_sm_info), GFP_KERNEL);
+ if (!sm_info)
+ return -ENOMEM;
+
+ /* init sm info */
+ sbi->sm_info = sm_info;
+ INIT_LIST_HEAD(&sm_info->wblist_head);
+ spin_lock_init(&sm_info->wblist_lock);
+ sm_info->seg0_blkaddr = le32_to_cpu(raw_super->segment0_blkaddr);
+ sm_info->main_blkaddr = le32_to_cpu(raw_super->main_blkaddr);
+ sm_info->segment_count = le32_to_cpu(raw_super->segment_count);
+ sm_info->reserved_segments = le32_to_cpu(ckpt->rsvd_segment_count);
+ sm_info->ovp_segments = le32_to_cpu(ckpt->overprov_segment_count);
+ sm_info->main_segments = le32_to_cpu(raw_super->segment_count_main);
+ sm_info->ssa_blkaddr = le32_to_cpu(raw_super->ssa_blkaddr);
+
+ err = build_sit_info(sbi);
+ if (err)
+ return err;
+ err = build_free_segmap(sbi);
+ if (err)
+ return err;
+ err = build_curseg(sbi);
+ if (err)
+ return err;
+
+ /* reinit free segmap based on SIT */
+ build_sit_entries(sbi);
+
+ init_free_segmap(sbi);
+ err = build_dirty_segmap(sbi);
+ if (err)
+ return err;
+
+ init_min_max_mtime(sbi);
+ return 0;
+}
+
+static void discard_dirty_segmap(struct f2fs_sb_info *sbi,
+ enum dirty_type dirty_type)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+
+ mutex_lock(&dirty_i->seglist_lock);
+ kfree(dirty_i->dirty_segmap[dirty_type]);
+ dirty_i->nr_dirty[dirty_type] = 0;
+ mutex_unlock(&dirty_i->seglist_lock);
+}
+
+void reset_victim_segmap(struct f2fs_sb_info *sbi)
+{
+ unsigned int bitmap_size = f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ memset(DIRTY_I(sbi)->victim_segmap[FG_GC], 0, bitmap_size);
+}
+
+static void destroy_victim_segmap(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+
+ kfree(dirty_i->victim_segmap[FG_GC]);
+ kfree(dirty_i->victim_segmap[BG_GC]);
+}
+
+static void destroy_dirty_segmap(struct f2fs_sb_info *sbi)
+{
+ struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
+ int i;
+
+ if (!dirty_i)
+ return;
+
+ /* discard pre-free/dirty segments list */
+ for (i = 0; i < NR_DIRTY_TYPE; i++)
+ discard_dirty_segmap(sbi, i);
+
+ destroy_victim_segmap(sbi);
+ SM_I(sbi)->dirty_info = NULL;
+ kfree(dirty_i);
+}
+
+static void destroy_curseg(struct f2fs_sb_info *sbi)
+{
+ struct curseg_info *array = SM_I(sbi)->curseg_array;
+ int i;
+
+ if (!array)
+ return;
+ SM_I(sbi)->curseg_array = NULL;
+ for (i = 0; i < NR_CURSEG_TYPE; i++)
+ kfree(array[i].sum_blk);
+ kfree(array);
+}
+
+static void destroy_free_segmap(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = SM_I(sbi)->free_info;
+ if (!free_i)
+ return;
+ SM_I(sbi)->free_info = NULL;
+ kfree(free_i->free_segmap);
+ kfree(free_i->free_secmap);
+ kfree(free_i);
+}
+
+static void destroy_sit_info(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int start;
+
+ if (!sit_i)
+ return;
+
+ if (sit_i->sentries) {
+ for (start = 0; start < TOTAL_SEGS(sbi); start++) {
+ kfree(sit_i->sentries[start].cur_valid_map);
+ kfree(sit_i->sentries[start].ckpt_valid_map);
+ }
+ }
+ vfree(sit_i->sentries);
+ vfree(sit_i->sec_entries);
+ kfree(sit_i->dirty_sentries_bitmap);
+
+ SM_I(sbi)->sit_info = NULL;
+ kfree(sit_i->sit_bitmap);
+ kfree(sit_i);
+}
+
+void destroy_segment_manager(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ destroy_dirty_segmap(sbi);
+ destroy_curseg(sbi);
+ destroy_free_segmap(sbi);
+ destroy_sit_info(sbi);
+ sbi->sm_info = NULL;
+ kfree(sm_info);
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 09:59:05

[permalink] [raw]

Subject: RE: [PATCH 03/17 v2] f2fs: add superblock and major in-memory structure

Change log from v1:
o Resolve checkpatch.pl errors

This adds the following major in-memory structures in f2fs.

- f2fs_sb_info:
contains f2fs-specific information, two special inode pointers for node and
meta address spaces, and orphan inode management.

- f2fs_inode_info:
contains vfs_inode and other fs-specific information.

- f2fs_nm_info:
contains node manager information such as NAT entry cache, free nid list,
and NAT page management.

- f2fs_node_info:
represents a node as node id, inode number, block address, and its version.

- f2fs_sm_info:
contains segment manager information such as SIT entry cache, free segment
map, current active logs, dirty segment management, and segment utilization.
The specific structures are sit_info, free_segmap_info, dirty_seglist_info,
curseg_info.

Signed-off-by: Chul Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/f2fs.h | 1062 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/node.h | 353 ++++++++++++++++++
fs/f2fs/segment.h | 615 +++++++++++++++++++++++++++++++
3 files changed, 2030 insertions(+)
create mode 100644 fs/f2fs/f2fs.h
create mode 100644 fs/f2fs/node.h
create mode 100644 fs/f2fs/segment.h

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
new file mode 100644
index 0000000..7aa70b5
--- /dev/null
+++ b/fs/f2fs/f2fs.h
@@ -0,0 +1,1062 @@
+/**
+ * fs/f2fs/f2fs.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_F2FS_H
+#define _LINUX_F2FS_H
+
+#include <linux/types.h>
+#include <linux/page-flags.h>
+#include <linux/buffer_head.h>
+#include <linux/version.h>
+#include <linux/slab.h>
+#include <linux/crc32.h>
+#include <linux/magic.h>
+
+/*
+ * For mount options
+ */
+#define F2FS_MOUNT_BG_GC 0x00000001
+#define F2FS_MOUNT_DISABLE_ROLL_FORWARD 0x00000002
+#define F2FS_MOUNT_DISCARD 0x00000004
+#define F2FS_MOUNT_NOHEAP 0x00000008
+#define F2FS_MOUNT_XATTR_USER 0x00000010
+#define F2FS_MOUNT_POSIX_ACL 0x00000020
+#define F2FS_MOUNT_DISABLE_EXT_IDENTIFY 0x00000040
+
+#define clear_opt(sbi, option) (sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
+#define set_opt(sbi, option) (sbi->mount_opt.opt |= F2FS_MOUNT_##option)
+#define test_opt(sbi, option) (sbi->mount_opt.opt & F2FS_MOUNT_##option)
+
+#define ver_after(a, b) (typecheck(unsigned long long, a) && \
+ typecheck(unsigned long long, b) && \
+ ((long long)((a) - (b)) > 0))
+
+typedef u64 block_t;
+typedef u32 nid_t;
+
+struct f2fs_mount_info {
+ unsigned int opt;
+};
+
+static inline __u32 f2fs_crc32(void *buff, size_t len)
+{
+ return crc32_le(F2FS_SUPER_MAGIC, buff, len);
+}
+
+static inline bool f2fs_crc_valid(__u32 blk_crc, void *buff, size_t buff_size)
+{
+ return f2fs_crc32(buff, buff_size) == blk_crc;
+}
+
+/*
+ * For checkpoint manager
+ */
+enum {
+ NAT_BITMAP,
+ SIT_BITMAP
+};
+
+/* for the list of orphan inodes */
+struct orphan_inode_entry {
+ struct list_head list; /* list head */
+ nid_t ino; /* inode number */
+};
+
+/* for the list of directory inodes */
+struct dir_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+};
+
+/* for the list of fsync inodes, used only during recovery */
+struct fsync_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+ block_t blkaddr; /* block address locating the last inode */
+};
+
+#define nats_in_cursum(sum) (le16_to_cpu(sum->n_nats))
+#define sits_in_cursum(sum) (le16_to_cpu(sum->n_sits))
+
+#define nat_in_journal(sum, i) (sum->nat_j.entries[i].ne)
+#define nid_in_journal(sum, i) (sum->nat_j.entries[i].nid)
+#define sit_in_journal(sum, i) (sum->sit_j.entries[i].se)
+#define segno_in_journal(sum, i) (sum->sit_j.entries[i].segno)
+
+static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = nats_in_cursum(rs);
+ rs->n_nats = cpu_to_le16(before + i);
+ return before;
+}
+
+static inline int update_sits_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = sits_in_cursum(rs);
+ rs->n_sits = cpu_to_le16(before + i);
+ return before;
+}
+
+/*
+ * For INODE and NODE manager
+ */
+#define XATTR_NODE_OFFSET (-1) /*
+ * store xattrs to one node block per
+ * file keeping -1 as its node offset to
+ * distinguish from index node blocks.
+ */
+#define RDONLY_NODE 1 /*
+ * specify a read-only mode when getting
+ * a node block. 0 is read-write mode.
+ * used by get_dnode_of_data().
+ */
+#define F2FS_LINK_MAX 32000 /* maximum link count per file */
+
+/* for in-memory extent cache entry */
+struct extent_info {
+ rwlock_t ext_lock; /* rwlock for consistency */
+ unsigned int fofs; /* start offset in a file */
+ u32 blk_addr; /* start block address of the extent */
+ unsigned int len; /* lenth of the extent */
+};
+
+/*
+ * i_advise uses FADVISE_XXX_BIT. We can add additional hints later.
+ */
+#define FADVISE_COLD_BIT 0x01
+
+struct f2fs_inode_info {
+ struct inode vfs_inode; /* serve a vfs inode */
+ unsigned long i_flags; /* keep an inode flags for ioctl */
+ unsigned char i_advise; /* use to give file attribute hints */
+ unsigned int i_current_depth; /* use only in directory structure */
+ umode_t i_acl_mode; /* keep file acl mode temporarily */
+
+ /* Use below internally in f2fs*/
+ unsigned long flags; /* use to pass per-file flags */
+ unsigned long long data_version;/* lastes version of data for fsync */
+ atomic_t dirty_dents; /* # of dirty dentry pages */
+ f2fs_hash_t chash; /* hash value of given file name */
+ unsigned int clevel; /* maximum level of given file name */
+ nid_t i_xattr_nid; /* node id that contains xattrs */
+ struct extent_info ext; /* in-memory extent cache entry */
+};
+
+static inline void get_extent_info(struct extent_info *ext,
+ struct f2fs_extent i_ext)
+{
+ write_lock(&ext->ext_lock);
+ ext->fofs = le32_to_cpu(i_ext.fofs);
+ ext->blk_addr = le32_to_cpu(i_ext.blk_addr);
+ ext->len = le32_to_cpu(i_ext.len);
+ write_unlock(&ext->ext_lock);
+}
+
+static inline void set_raw_extent(struct extent_info *ext,
+ struct f2fs_extent *i_ext)
+{
+ read_lock(&ext->ext_lock);
+ i_ext->fofs = cpu_to_le32(ext->fofs);
+ i_ext->blk_addr = cpu_to_le32(ext->blk_addr);
+ i_ext->len = cpu_to_le32(ext->len);
+ read_unlock(&ext->ext_lock);
+}
+
+struct f2fs_nm_info {
+ block_t nat_blkaddr; /* base disk address of NAT */
+ nid_t max_nid; /* maximum possible node ids */
+ nid_t init_scan_nid; /* the first nid to be scanned */
+ nid_t next_scan_nid; /* the next nid to be scanned */
+
+ /* NAT cache management */
+ struct radix_tree_root nat_root;/* root of the nat entry cache */
+ rwlock_t nat_tree_lock; /* protect nat_tree_lock */
+ unsigned int nat_cnt; /* the # of cached nat entries */
+ struct list_head nat_entries; /* cached nat entry list (clean) */
+ struct list_head dirty_nat_entries; /* cached nat entry list (dirty) */
+
+ /* free node ids management */
+ struct list_head free_nid_list; /* a list for free nids */
+ spinlock_t free_nid_list_lock; /* protect free nid list */
+ unsigned int fcnt; /* the number of free node id */
+ struct mutex build_lock; /* lock for build free nids */
+
+ /* for checkpoint */
+ char *nat_bitmap; /* NAT bitmap pointer */
+ int bitmap_size; /* bitmap size */
+};
+
+/*
+ * this structure is used as one of function parameters.
+ * all the information are dedicated to a given direct node block determined
+ * by the data offset in a file.
+ */
+struct dnode_of_data {
+ struct inode *inode; /* vfs inode pointer */
+ struct page *inode_page; /* its inode page, NULL is possible */
+ struct page *node_page; /* cached direct node page */
+ nid_t nid; /* node id of the direct node block */
+ unsigned int ofs_in_node; /* data offset in the node page */
+ bool inode_page_locked; /* inode page is locked or not */
+ block_t data_blkaddr; /* block address of the node block */
+};
+
+static inline void set_new_dnode(struct dnode_of_data *dn, struct inode *inode,
+ struct page *ipage, struct page *npage, nid_t nid)
+{
+ dn->inode = inode;
+ dn->inode_page = ipage;
+ dn->node_page = npage;
+ dn->nid = nid;
+ dn->inode_page_locked = 0;
+}
+
+/*
+ * For SIT manager
+ *
+ * By default, there are 6 active log areas across the whole main area.
+ * When considering hot and cold data separation to reduce cleaning overhead,
+ * we split 3 for data logs and 3 for node logs as hot, warm, and cold types,
+ * respectively.
+ * In the current design, you should not change the numbers intentionally.
+ * Instead, as a mount option such as active_logs=x, you can use 2, 4, and 6
+ * logs individually according to the underlying devices. (default: 6)
+ * Just in case, on-disk layout covers maximum 16 logs that consist of 8 for
+ * data and 8 for node logs.
+ */
+#define NR_CURSEG_DATA_TYPE (3)
+#define NR_CURSEG_NODE_TYPE (3)
+#define NR_CURSEG_TYPE (NR_CURSEG_DATA_TYPE + NR_CURSEG_NODE_TYPE)
+
+enum {
+ CURSEG_HOT_DATA = 0, /* directory entry blocks */
+ CURSEG_WARM_DATA, /* data blocks */
+ CURSEG_COLD_DATA, /* multimedia or GCed data blocks */
+ CURSEG_HOT_NODE, /* direct node blocks of directory files */
+ CURSEG_WARM_NODE, /* direct node blocks of normal files */
+ CURSEG_COLD_NODE, /* indirect node blocks */
+ NO_CHECK_TYPE
+};
+
+struct f2fs_sm_info {
+ struct sit_info *sit_info; /* whole segment information */
+ struct free_segmap_info *free_info; /* free segment information */
+ struct dirty_seglist_info *dirty_info; /* dirty segment information */
+ struct curseg_info *curseg_array; /* active segment information */
+
+ struct list_head wblist_head; /* list of under-writeback pages */
+ spinlock_t wblist_lock; /* lock for checkpoint */
+
+ block_t seg0_blkaddr; /* block address of 0'th segment */
+ block_t main_blkaddr; /* start block address of main area */
+ block_t ssa_blkaddr; /* start block address of SSA area */
+
+ unsigned int segment_count; /* total # of segments */
+ unsigned int main_segments; /* # of segments in main area */
+ unsigned int reserved_segments; /* # of reserved segments */
+ unsigned int ovp_segments; /* # of overprovision segments */
+};
+
+/*
+ * For directory operation
+ */
+#define NODE_DIR1_BLOCK (ADDRS_PER_INODE + 1)
+#define NODE_DIR2_BLOCK (ADDRS_PER_INODE + 2)
+#define NODE_IND1_BLOCK (ADDRS_PER_INODE + 3)
+#define NODE_IND2_BLOCK (ADDRS_PER_INODE + 4)
+#define NODE_DIND_BLOCK (ADDRS_PER_INODE + 5)
+
+/*
+ * For superblock
+ */
+/*
+ * COUNT_TYPE for monitoring
+ *
+ * f2fs monitors the number of several block types such as on-writeback,
+ * dirty dentry blocks, dirty node blocks, and dirty meta blocks.
+ */
+enum count_type {
+ F2FS_WRITEBACK,
+ F2FS_DIRTY_DENTS,
+ F2FS_DIRTY_NODES,
+ F2FS_DIRTY_META,
+ NR_COUNT_TYPE,
+};
+
+/*
+ * FS_LOCK nesting subclasses for the lock validator:
+ *
+ * The locking order between these classes is
+ * RENAME -> DENTRY_OPS -> DATA_WRITE -> DATA_NEW
+ * -> DATA_TRUNC -> NODE_WRITE -> NODE_NEW -> NODE_TRUNC
+ */
+enum lock_type {
+ RENAME, /* for renaming operations */
+ DENTRY_OPS, /* for directory operations */
+ DATA_WRITE, /* for data write */
+ DATA_NEW, /* for data allocation */
+ DATA_TRUNC, /* for data truncate */
+ NODE_NEW, /* for node allocation */
+ NODE_TRUNC, /* for node truncate */
+ NODE_WRITE, /* for node write */
+ NR_LOCK_TYPE,
+};
+
+/*
+ * The below are the page types of bios used in submti_bio().
+ * The available types are:
+ * DATA User data pages. It operates as async mode.
+ * NODE Node pages. It operates as async mode.
+ * META FS metadata pages such as SIT, NAT, CP.
+ * NR_PAGE_TYPE The number of page types.
+ * META_FLUSH Make sure the previous pages are written
+ * with waiting the bio's completion
+ * ... Only can be used with META.
+ */
+enum page_type {
+ DATA,
+ NODE,
+ META,
+ NR_PAGE_TYPE,
+ META_FLUSH,
+};
+
+struct f2fs_sb_info {
+ struct super_block *sb; /* pointer to VFS super block */
+ struct buffer_head *raw_super_buf; /* buffer head of raw sb */
+ struct f2fs_super_block *raw_super; /* raw super block pointer */
+ int s_dirty; /* dirty flag for checkpoint */
+
+ /* for node-related operations */
+ struct f2fs_nm_info *nm_info; /* node manager */
+ struct inode *node_inode; /* cache node blocks */
+
+ /* for segment-related operations */
+ struct f2fs_sm_info *sm_info; /* segment manager */
+ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */
+ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */
+ struct rw_semaphore bio_sem; /* IO semaphore */
+
+ /* for checkpoint */
+ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */
+ struct inode *meta_inode; /* cache meta blocks */
+ struct mutex cp_mutex; /* for checkpoint procedure */
+ struct mutex fs_lock[NR_LOCK_TYPE]; /* for blocking FS operations */
+ struct mutex write_inode; /* mutex for write inode */
+ struct mutex writepages; /* mutex for writepages() */
+ int por_doing; /* recovery is doing or not */
+
+ /* for orphan inode management */
+ struct list_head orphan_inode_list; /* orphan inode list */
+ struct mutex orphan_inode_mutex; /* for orphan inode list */
+ unsigned int n_orphans; /* # of orphan inodes */
+
+ /* for directory inode management */
+ struct list_head dir_inode_list; /* dir inode list */
+ spinlock_t dir_inode_lock; /* for dir inode list lock */
+ unsigned int n_dirty_dirs; /* # of dir inodes */
+
+ /* basic file system units */
+ unsigned int log_sectors_per_block; /* log2 sectors per block */
+ unsigned int log_blocksize; /* log2 block size */
+ unsigned int blocksize; /* block size */
+ unsigned int root_ino_num; /* root inode number*/
+ unsigned int node_ino_num; /* node inode number*/
+ unsigned int meta_ino_num; /* meta inode number*/
+ unsigned int log_blocks_per_seg; /* log2 blocks per segment */
+ unsigned int blocks_per_seg; /* blocks per segment */
+ unsigned int segs_per_sec; /* segments per section */
+ unsigned int secs_per_zone; /* sections per zone */
+ unsigned int total_sections; /* total section count */
+ unsigned int total_node_count; /* total node block count */
+ unsigned int total_valid_node_count; /* valid node block count */
+ unsigned int total_valid_inode_count; /* valid inode count */
+ int active_logs; /* # of active logs */
+
+ block_t user_block_count; /* # of user blocks */
+ block_t total_valid_block_count; /* # of valid blocks */
+ block_t alloc_valid_block_count; /* # of allocated blocks */
+ block_t last_valid_block_count; /* for recovery */
+ u32 s_next_generation; /* for NFS support */
+ atomic_t nr_pages[NR_COUNT_TYPE]; /* # of pages, see count_type */
+
+ struct f2fs_mount_info mount_opt; /* mount options */
+
+ /* for cleaning operations */
+ struct mutex gc_mutex; /* mutex for GC */
+ struct f2fs_gc_kthread *gc_thread; /* GC thread */
+
+ /*
+ * for stat information.
+ * one is for the LFS mode, and the other is for the SSR mode.
+ */
+ struct f2fs_stat_info *stat_info; /* FS status information */
+ unsigned int segment_count[2]; /* # of allocated segments */
+ unsigned int block_count[2]; /* # of allocated blocks */
+ unsigned int last_victim[2]; /* last victim segment # */
+ int total_hit_ext, read_hit_ext; /* extent cache hit ratio */
+ int bg_gc; /* background gc calls */
+ spinlock_t stat_lock; /* lock for stat operations */
+};
+
+/*
+ * Inline functions
+ */
+static inline struct f2fs_inode_info *F2FS_I(struct inode *inode)
+{
+ return container_of(inode, struct f2fs_inode_info, vfs_inode);
+}
+
+static inline struct f2fs_sb_info *F2FS_SB(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
+static inline struct f2fs_super_block *F2FS_RAW_SUPER(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_super_block *)(sbi->raw_super);
+}
+
+static inline struct f2fs_checkpoint *F2FS_CKPT(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_checkpoint *)(sbi->ckpt);
+}
+
+static inline struct f2fs_nm_info *NM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_nm_info *)(sbi->nm_info);
+}
+
+static inline struct f2fs_sm_info *SM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_sm_info *)(sbi->sm_info);
+}
+
+static inline struct sit_info *SIT_I(struct f2fs_sb_info *sbi)
+{
+ return (struct sit_info *)(SM_I(sbi)->sit_info);
+}
+
+static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
+{
+ return (struct free_segmap_info *)(SM_I(sbi)->free_info);
+}
+
+static inline struct dirty_seglist_info *DIRTY_I(struct f2fs_sb_info *sbi)
+{
+ return (struct dirty_seglist_info *)(SM_I(sbi)->dirty_info);
+}
+
+static inline void F2FS_SET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 1;
+}
+
+static inline void F2FS_RESET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 0;
+}
+
+static inline void mutex_lock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_lock_nested(&sbi->fs_lock[t], t);
+}
+
+static inline void mutex_unlock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_unlock(&sbi->fs_lock[t]);
+}
+
+/*
+ * Check whether the given nid is within node id range.
+ */
+static inline void check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ BUG_ON((nid >= NM_I(sbi)->max_nid));
+}
+
+#define F2FS_DEFAULT_ALLOCATED_BLOCKS 1
+
+/*
+ * Check whether the inode has blocks or not
+ */
+static inline int F2FS_HAS_BLOCKS(struct inode *inode)
+{
+ if (F2FS_I(inode)->i_xattr_nid)
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS + 1);
+ else
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS);
+}
+
+static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode, blkcnt_t count)
+{
+ block_t valid_block_count;
+
+ spin_lock(&sbi->stat_lock);
+ valid_block_count =
+ sbi->total_valid_block_count + (block_t)count;
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+ inode->i_blocks += count;
+ sbi->total_valid_block_count = valid_block_count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return true;
+}
+
+static inline int dec_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ blkcnt_t count)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_block_count < (block_t) count);
+ BUG_ON(inode->i_blocks < count);
+ inode->i_blocks -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline void inc_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_inc(&sbi->nr_pages[count_type]);
+ F2FS_SET_SB_DIRT(sbi);
+}
+
+static inline void inode_inc_dirty_dents(struct inode *inode)
+{
+ atomic_inc(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline void dec_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_dec(&sbi->nr_pages[count_type]);
+}
+
+static inline void inode_dec_dirty_dents(struct inode *inode)
+{
+ atomic_dec(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline int get_pages(struct f2fs_sb_info *sbi, int count_type)
+{
+ return atomic_read(&sbi->nr_pages[count_type]);
+}
+
+static inline block_t valid_user_blocks(struct f2fs_sb_info *sbi)
+{
+ block_t ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline unsigned long __bitmap_size(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+
+ /* return NAT or SIT bitmap */
+ if (flag == NAT_BITMAP)
+ return le32_to_cpu(ckpt->nat_ver_bitmap_bytesize);
+ else if (flag == SIT_BITMAP)
+ return le32_to_cpu(ckpt->sit_ver_bitmap_bytesize);
+
+ return 0;
+}
+
+static inline void *__bitmap_ptr(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ int offset = (flag == NAT_BITMAP) ? ckpt->sit_ver_bitmap_bytesize : 0;
+ return &ckpt->sit_nat_version_bitmap + offset;
+}
+
+static inline block_t __start_cp_addr(struct f2fs_sb_info *sbi)
+{
+ block_t start_addr;
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ unsigned long long ckpt_version = le64_to_cpu(ckpt->checkpoint_ver);
+
+ start_addr = le64_to_cpu(F2FS_RAW_SUPER(sbi)->cp_blkaddr);
+
+ /*
+ * odd numbered checkpoint should at cp segment 0
+ * and even segent must be at cp segment 1
+ */
+ if (!(ckpt_version & 1))
+ start_addr += sbi->blocks_per_seg;
+
+ return start_addr;
+}
+
+static inline block_t __start_sum_addr(struct f2fs_sb_info *sbi)
+{
+ return le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline bool inc_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ block_t valid_block_count;
+ unsigned int valid_node_count;
+
+ spin_lock(&sbi->stat_lock);
+
+ valid_block_count = sbi->total_valid_block_count + (block_t)count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ valid_node_count = sbi->total_valid_node_count + count;
+
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (valid_node_count > sbi->total_node_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (inode)
+ inode->i_blocks += count;
+ sbi->total_valid_node_count = valid_node_count;
+ sbi->total_valid_block_count = valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+
+ return true;
+}
+
+static inline void dec_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ spin_lock(&sbi->stat_lock);
+
+ BUG_ON(sbi->total_valid_block_count < count);
+ BUG_ON(sbi->total_valid_node_count < count);
+ BUG_ON(inode->i_blocks < count);
+
+ inode->i_blocks -= count;
+ sbi->total_valid_node_count -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline unsigned int valid_node_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_node_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void inc_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_inode_count == sbi->total_node_count);
+ sbi->total_valid_inode_count++;
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline int dec_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(!sbi->total_valid_inode_count);
+ sbi->total_valid_inode_count--;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline unsigned int valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_inode_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void f2fs_put_page(struct page *page, int unlock)
+{
+ if (!page || IS_ERR(page))
+ return;
+
+ if (unlock) {
+ BUG_ON(!PageLocked(page));
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+static inline void f2fs_put_dnode(struct dnode_of_data *dn)
+{
+ if (dn->node_page)
+ f2fs_put_page(dn->node_page, 1);
+ if (dn->inode_page && dn->node_page != dn->inode_page)
+ f2fs_put_page(dn->inode_page, 0);
+ dn->node_page = NULL;
+ dn->inode_page = NULL;
+}
+
+static inline struct kmem_cache *f2fs_kmem_cache_create(const char *name,
+ size_t size, void (*ctor)(void *))
+{
+ return kmem_cache_create(name, size, 0, SLAB_RECLAIM_ACCOUNT, ctor);
+}
+
+#define RAW_IS_INODE(p) ((p)->footer.nid == (p)->footer.ino)
+
+static inline bool IS_INODE(struct page *page)
+{
+ struct f2fs_node *p = (struct f2fs_node *)page_address(page);
+ return RAW_IS_INODE(p);
+}
+
+static inline __le32 *blkaddr_in_node(struct f2fs_node *node)
+{
+ return RAW_IS_INODE(node) ? node->i.i_addr : node->dn.addr;
+}
+
+static inline block_t datablock_addr(struct page *node_page,
+ unsigned int offset)
+{
+ struct f2fs_node *raw_node;
+ __le32 *addr_array;
+ raw_node = (struct f2fs_node *)page_address(node_page);
+ addr_array = blkaddr_in_node(raw_node);
+ return le32_to_cpu(addr_array[offset]);
+}
+
+static inline int f2fs_test_bit(unsigned int nr, char *addr)
+{
+ int mask;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ return mask & *addr;
+}
+
+static inline int f2fs_set_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr |= mask;
+ return ret;
+}
+
+static inline int f2fs_clear_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr &= ~mask;
+ return ret;
+}
+
+/* used for f2fs_inode_info->flags */
+enum {
+ FI_NEW_INODE, /* indicate newly allocated inode */
+ FI_NEED_CP, /* need to do checkpoint during fsync */
+ FI_INC_LINK, /* need to increment i_nlink */
+ FI_ACL_MODE, /* indicate acl mode */
+ FI_NO_ALLOC, /* should not allocate any blocks */
+};
+
+static inline void set_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ set_bit(flag, &fi->flags);
+}
+
+static inline int is_inode_flag_set(struct f2fs_inode_info *fi, int flag)
+{
+ return test_bit(flag, &fi->flags);
+}
+
+static inline void clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ clear_bit(flag, &fi->flags);
+}
+
+static inline void set_acl_inode(struct f2fs_inode_info *fi, umode_t mode)
+{
+ fi->i_acl_mode = mode;
+ set_inode_flag(fi, FI_ACL_MODE);
+}
+
+static inline int cond_clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ if (is_inode_flag_set(fi, FI_ACL_MODE)) {
+ clear_inode_flag(fi, FI_ACL_MODE);
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * file.c
+ */
+int f2fs_sync_file(struct file *, loff_t, loff_t, int);
+void truncate_data_blocks(struct dnode_of_data *);
+void f2fs_truncate(struct inode *);
+int f2fs_setattr(struct dentry *, struct iattr *);
+int truncate_hole(struct inode *, pgoff_t, pgoff_t);
+long f2fs_ioctl(struct file *, unsigned int, unsigned long);
+
+/*
+ * inode.c
+ */
+void f2fs_set_inode_flags(struct inode *);
+struct inode *f2fs_iget_nowait(struct super_block *, unsigned long);
+struct inode *f2fs_iget(struct super_block *, unsigned long);
+void update_inode(struct inode *, struct page *);
+int f2fs_write_inode(struct inode *, struct writeback_control *);
+void f2fs_evict_inode(struct inode *);
+
+/*
+ * namei.c
+ */
+struct dentry *f2fs_get_parent(struct dentry *child);
+
+/*
+ * dir.c
+ */
+struct f2fs_dir_entry *f2fs_find_entry(struct inode *, struct qstr *,
+ struct page **);
+struct f2fs_dir_entry *f2fs_parent_dir(struct inode *, struct page **);
+ino_t f2fs_inode_by_name(struct inode *, struct qstr *);
+void f2fs_set_link(struct inode *, struct f2fs_dir_entry *,
+ struct page *, struct inode *);
+void init_dent_inode(struct dentry *, struct page *);
+int f2fs_add_link(struct dentry *, struct inode *);
+void f2fs_delete_entry(struct f2fs_dir_entry *, struct page *, struct inode *);
+int f2fs_make_empty(struct inode *, struct inode *);
+bool f2fs_empty_dir(struct inode *);
+
+/*
+ * super.c
+ */
+int f2fs_sync_fs(struct super_block *, int);
+
+/*
+ * hash.c
+ */
+f2fs_hash_t f2fs_dentry_hash(const char *, int);
+
+/*
+ * node.c
+ */
+struct dnode_of_data;
+struct node_info;
+
+int is_checkpointed_node(struct f2fs_sb_info *, nid_t);
+void get_node_info(struct f2fs_sb_info *, nid_t, struct node_info *);
+int get_dnode_of_data(struct dnode_of_data *, pgoff_t, int);
+int truncate_inode_blocks(struct inode *, pgoff_t);
+int remove_inode_page(struct inode *);
+int new_inode_page(struct inode *, struct dentry *);
+struct page *new_node_page(struct dnode_of_data *, unsigned int);
+void ra_node_page(struct f2fs_sb_info *, nid_t);
+struct page *get_node_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_node_page_ra(struct page *, int);
+void sync_inode_page(struct dnode_of_data *);
+int sync_node_pages(struct f2fs_sb_info *, nid_t, struct writeback_control *);
+bool alloc_nid(struct f2fs_sb_info *, nid_t *);
+void alloc_nid_done(struct f2fs_sb_info *, nid_t);
+void alloc_nid_failed(struct f2fs_sb_info *, nid_t);
+void recover_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, struct node_info *, block_t);
+int recover_inode_page(struct f2fs_sb_info *, struct page *);
+int restore_node_summary(struct f2fs_sb_info *, unsigned int,
+ struct f2fs_summary_block *);
+void flush_nat_entries(struct f2fs_sb_info *);
+int build_node_manager(struct f2fs_sb_info *);
+void destroy_node_manager(struct f2fs_sb_info *);
+int create_node_manager_caches(void);
+void destroy_node_manager_caches(void);
+
+/*
+ * segment.c
+ */
+void f2fs_balance_fs(struct f2fs_sb_info *);
+void invalidate_blocks(struct f2fs_sb_info *, block_t);
+void locate_dirty_segment(struct f2fs_sb_info *, unsigned int);
+void clear_prefree_segments(struct f2fs_sb_info *);
+int npages_for_summary_flush(struct f2fs_sb_info *);
+void allocate_new_segments(struct f2fs_sb_info *);
+struct page *get_sum_page(struct f2fs_sb_info *, unsigned int);
+struct bio *f2fs_bio_alloc(struct block_device *, sector_t, int, gfp_t);
+void f2fs_submit_bio(struct f2fs_sb_info *, enum page_type, bool sync);
+int write_meta_page(struct f2fs_sb_info *, struct page *,
+ struct writeback_control *);
+void write_node_page(struct f2fs_sb_info *, struct page *, unsigned int,
+ block_t, block_t *);
+void write_data_page(struct inode *, struct page *, struct dnode_of_data*,
+ block_t, block_t *);
+void rewrite_data_page(struct f2fs_sb_info *, struct page *, block_t);
+void recover_data_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void rewrite_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void write_data_summaries(struct f2fs_sb_info *, block_t);
+void write_node_summaries(struct f2fs_sb_info *, block_t);
+int lookup_journal_in_cursum(struct f2fs_summary_block *,
+ int, unsigned int, int);
+void flush_sit_entries(struct f2fs_sb_info *);
+int build_segment_manager(struct f2fs_sb_info *);
+void reset_victim_segmap(struct f2fs_sb_info *);
+void destroy_segment_manager(struct f2fs_sb_info *);
+
+/*
+ * checkpoint.c
+ */
+struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t);
+long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long);
+int check_orphan_space(struct f2fs_sb_info *);
+void add_orphan_inode(struct f2fs_sb_info *, nid_t);
+void remove_orphan_inode(struct f2fs_sb_info *, nid_t);
+int recover_orphan_inodes(struct f2fs_sb_info *);
+int get_valid_checkpoint(struct f2fs_sb_info *);
+void set_dirty_dir_page(struct inode *, struct page *);
+void remove_dirty_dir_inode(struct inode *);
+void sync_dirty_dir_inodes(struct f2fs_sb_info *);
+void block_operations(struct f2fs_sb_info *);
+void write_checkpoint(struct f2fs_sb_info *, bool, bool);
+void init_orphan_info(struct f2fs_sb_info *);
+int create_checkpoint_caches(void);
+void destroy_checkpoint_caches(void);
+
+/*
+ * data.c
+ */
+int reserve_new_block(struct dnode_of_data *);
+void update_extent_cache(block_t, struct dnode_of_data *);
+struct page *find_data_page(struct inode *, pgoff_t);
+struct page *get_lock_data_page(struct inode *, pgoff_t);
+struct page *get_new_data_page(struct inode *, pgoff_t, bool);
+int f2fs_readpage(struct f2fs_sb_info *, struct page *, block_t, int);
+int do_write_data_page(struct page *);
+
+/*
+ * gc.c
+ */
+int start_gc_thread(struct f2fs_sb_info *);
+void stop_gc_thread(struct f2fs_sb_info *);
+block_t start_bidx_of_node(unsigned int);
+int f2fs_gc(struct f2fs_sb_info *, int);
+void build_gc_manager(struct f2fs_sb_info *);
+int create_gc_caches(void);
+void destroy_gc_caches(void);
+
+/*
+ * recovery.c
+ */
+void recover_fsync_data(struct f2fs_sb_info *);
+bool space_for_roll_forward(struct f2fs_sb_info *);
+
+/*
+ * debug.c
+ */
+#ifdef CONFIG_F2FS_STAT_FS
+struct f2fs_stat_info {
+ struct list_head stat_list;
+ struct f2fs_sb_info *sbi;
+ struct mutex stat_lock;
+ int all_area_segs, sit_area_segs, nat_area_segs, ssa_area_segs;
+ int main_area_segs, main_area_sections, main_area_zones;
+ int hit_ext, total_ext;
+ int ndirty_node, ndirty_dent, ndirty_dirs, ndirty_meta;
+ int nats, sits, fnids;
+ int total_count, utilization;
+ int bg_gc;
+ unsigned int valid_count, valid_node_count, valid_inode_count;
+ unsigned int bimodal, avg_vblocks;
+ int util_free, util_valid, util_invalid;
+ int rsvd_segs, overp_segs;
+ int dirty_count, node_pages, meta_pages;
+ int prefree_count, call_count;
+ int tot_segs, node_segs, data_segs, free_segs, free_secs;
+ int tot_blks, data_blks, node_blks;
+ int curseg[NR_CURSEG_TYPE];
+ int cursec[NR_CURSEG_TYPE];
+ int curzone[NR_CURSEG_TYPE];
+
+ unsigned int segment_count[2];
+ unsigned int block_count[2];
+ unsigned base_mem, cache_mem;
+};
+
+#define stat_inc_call_count(si) ((si)->call_count++)
+
+#define stat_inc_seg_count(sbi, type) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ (si)->tot_segs++; \
+ if (type == SUM_TYPE_DATA) \
+ si->data_segs++; \
+ else \
+ si->node_segs++; \
+ } while (0)
+
+#define stat_inc_tot_blk_count(si, blks) \
+ (si->tot_blks += (blks))
+
+#define stat_inc_data_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->data_blks += (blks); \
+ } while (0)
+
+#define stat_inc_node_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->node_blks += (blks); \
+ } while (0)
+
+int f2fs_build_stats(struct f2fs_sb_info *);
+void f2fs_destroy_stats(struct f2fs_sb_info *);
+void destroy_root_stats(void);
+#else
+#define stat_inc_call_count(si)
+#define stat_inc_seg_count(si, type)
+#define stat_inc_tot_blk_count(si, blks)
+#define stat_inc_data_blk_count(si, blks)
+#define stat_inc_node_blk_count(sbi, blks)
+
+static inline int f2fs_build_stats(struct f2fs_sb_info *sbi) { return 0; }
+static inline void f2fs_destroy_stats(struct f2fs_sb_info *sbi) { }
+static inline void destroy_root_stats(void) { }
+#endif
+
+extern const struct file_operations f2fs_dir_operations;
+extern const struct file_operations f2fs_file_operations;
+extern const struct inode_operations f2fs_file_inode_operations;
+extern const struct address_space_operations f2fs_dblock_aops;
+extern const struct address_space_operations f2fs_node_aops;
+extern const struct address_space_operations f2fs_meta_aops;
+extern const struct inode_operations f2fs_dir_inode_operations;
+extern const struct inode_operations f2fs_symlink_inode_operations;
+extern const struct inode_operations f2fs_special_inode_operations;
+#endif
diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
new file mode 100644
index 0000000..5d525ed
--- /dev/null
+++ b/fs/f2fs/node.h
@@ -0,0 +1,353 @@
+/**
+ * fs/f2fs/node.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* start node id of a node block dedicated to the given node id */
+#define START_NID(nid) ((nid / NAT_ENTRY_PER_BLOCK) * NAT_ENTRY_PER_BLOCK)
+
+/* node block offset on the NAT area dedicated to the given start node id */
+#define NAT_BLOCK_OFFSET(start_nid) (start_nid / NAT_ENTRY_PER_BLOCK)
+
+/* # of pages to perform readahead before building free nids */
+#define FREE_NID_PAGES 4
+
+/* maximum # of free node ids to produce during build_free_nids */
+#define MAX_FREE_NIDS (NAT_ENTRY_PER_BLOCK * FREE_NID_PAGES)
+
+/* maximum readahead size for node during getting data blocks */
+#define MAX_RA_NODE 128
+
+/* maximum cached nat entries to manage memory footprint */
+#define NM_WOUT_THRESHOLD (64 * NAT_ENTRY_PER_BLOCK)
+
+/* vector size for gang look-up from nat cache that consists of radix tree */
+#define NATVEC_SIZE 64
+
+/*
+ * For node information
+ */
+struct node_info {
+ nid_t nid; /* node id */
+ nid_t ino; /* inode number of the node's owner */
+ block_t blk_addr; /* block address of the node */
+ unsigned char version; /* version of the node */
+};
+
+struct nat_entry {
+ struct list_head list; /* for clean or dirty nat list */
+ bool checkpointed; /* whether it is checkpointed or not */
+ struct node_info ni; /* in-memory node information */
+};
+
+#define nat_get_nid(nat) (nat->ni.nid)
+#define nat_set_nid(nat, n) (nat->ni.nid = n)
+#define nat_get_blkaddr(nat) (nat->ni.blk_addr)
+#define nat_set_blkaddr(nat, b) (nat->ni.blk_addr = b)
+#define nat_get_ino(nat) (nat->ni.ino)
+#define nat_set_ino(nat, i) (nat->ni.ino = i)
+#define nat_get_version(nat) (nat->ni.version)
+#define nat_set_version(nat, v) (nat->ni.version = v)
+
+#define __set_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->dirty_nat_entries);
+#define __clear_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->nat_entries);
+#define inc_node_version(version) (++version)
+
+static inline void node_info_from_raw_nat(struct node_info *ni,
+ struct f2fs_nat_entry *raw_ne)
+{
+ ni->ino = le32_to_cpu(raw_ne->ino);
+ ni->blk_addr = le32_to_cpu(raw_ne->block_addr);
+ ni->version = raw_ne->version;
+}
+
+/*
+ * For free nid mangement
+ */
+enum nid_state {
+ NID_NEW, /* newly added to free nid list */
+ NID_ALLOC /* it is allocated */
+};
+
+struct free_nid {
+ struct list_head list; /* for free node id list */
+ nid_t nid; /* node id */
+ int state; /* in use or not: NID_NEW or NID_ALLOC */
+};
+
+static inline int next_free_nid(struct f2fs_sb_info *sbi, nid_t *nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *fnid;
+
+ if (nm_i->fcnt <= 0)
+ return -1;
+ spin_lock(&nm_i->free_nid_list_lock);
+ fnid = list_entry(nm_i->free_nid_list.next, struct free_nid, list);
+ *nid = fnid->nid;
+ spin_unlock(&nm_i->free_nid_list_lock);
+ return 0;
+}
+
+/*
+ * inline functions
+ */
+static inline void get_nat_bitmap(struct f2fs_sb_info *sbi, void *addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ memcpy(addr, nm_i->nat_bitmap, nm_i->bitmap_size);
+}
+
+static inline pgoff_t current_nat_addr(struct f2fs_sb_info *sbi, nid_t start)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ pgoff_t block_off;
+ pgoff_t block_addr;
+ int seg_off;
+
+ block_off = NAT_BLOCK_OFFSET(start);
+ seg_off = block_off >> sbi->log_blocks_per_seg;
+
+ block_addr = (pgoff_t)(nm_i->nat_blkaddr +
+ (seg_off << sbi->log_blocks_per_seg << 1) +
+ (block_off & ((1 << sbi->log_blocks_per_seg) - 1)));
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr;
+}
+
+static inline pgoff_t next_nat_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+
+ block_addr -= nm_i->nat_blkaddr;
+ if ((block_addr >> sbi->log_blocks_per_seg) % 2)
+ block_addr -= sbi->blocks_per_seg;
+ else
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr + nm_i->nat_blkaddr;
+}
+
+static inline void set_to_next_nat(struct f2fs_nm_info *nm_i, nid_t start_nid)
+{
+ unsigned int block_off = NAT_BLOCK_OFFSET(start_nid);
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ f2fs_clear_bit(block_off, nm_i->nat_bitmap);
+ else
+ f2fs_set_bit(block_off, nm_i->nat_bitmap);
+}
+
+static inline void fill_node_footer(struct page *page, nid_t nid,
+ nid_t ino, unsigned int ofs, bool reset)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ if (reset)
+ memset(rn, 0, sizeof(*rn));
+ rn->footer.nid = cpu_to_le32(nid);
+ rn->footer.ino = cpu_to_le32(ino);
+ rn->footer.flag = cpu_to_le32(ofs << OFFSET_BIT_SHIFT);
+}
+
+static inline void copy_node_footer(struct page *dst, struct page *src)
+{
+ void *src_addr = page_address(src);
+ void *dst_addr = page_address(dst);
+ struct f2fs_node *src_rn = (struct f2fs_node *)src_addr;
+ struct f2fs_node *dst_rn = (struct f2fs_node *)dst_addr;
+ memcpy(&dst_rn->footer, &src_rn->footer, sizeof(struct node_footer));
+}
+
+static inline void fill_node_footer_blkaddr(struct page *page, block_t blkaddr)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ rn->footer.cp_ver = ckpt->checkpoint_ver;
+ rn->footer.next_blkaddr = blkaddr;
+}
+
+static inline nid_t ino_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.ino);
+}
+
+static inline nid_t nid_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.nid);
+}
+
+static inline unsigned int ofs_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned flag = le32_to_cpu(rn->footer.flag);
+ return flag >> OFFSET_BIT_SHIFT;
+}
+
+static inline unsigned long long cpver_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le64_to_cpu(rn->footer.cp_ver);
+}
+
+static inline block_t next_blkaddr_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.next_blkaddr);
+}
+
+/*
+ * f2fs assigns the following node offsets described as (num).
+ * N = NIDS_PER_BLOCK
+ *
+ * Inode block (0)
+ * |- direct node (1)
+ * |- direct node (2)
+ * |- indirect node (3)
+ * | `- direct node (4 => 4 + N - 1)
+ * |- indirect node (4 + N)
+ * | `- direct node (5 + N => 5 + 2N - 1)
+ * `- double indirect node (5 + 2N)
+ * `- indirect node (6 + 2N)
+ * `- direct node (x(N + 1))
+ */
+static inline bool IS_DNODE(struct page *node_page)
+{
+ unsigned int ofs = ofs_of_node(node_page);
+ if (ofs == 3 || ofs == 4 + NIDS_PER_BLOCK ||
+ ofs == 5 + 2 * NIDS_PER_BLOCK)
+ return false;
+ if (ofs >= 6 + 2 * NIDS_PER_BLOCK) {
+ ofs -= 6 + 2 * NIDS_PER_BLOCK;
+ if ((long int)ofs % (NIDS_PER_BLOCK + 1))
+ return false;
+ }
+ return true;
+}
+
+static inline void set_nid(struct page *p, int off, nid_t nid, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+
+ wait_on_page_writeback(p);
+
+ if (i)
+ rn->i.i_nid[off - NODE_DIR1_BLOCK] = cpu_to_le32(nid);
+ else
+ rn->in.nid[off] = cpu_to_le32(nid);
+ set_page_dirty(p);
+}
+
+static inline nid_t get_nid(struct page *p, int off, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+ if (i)
+ return le32_to_cpu(rn->i.i_nid[off - NODE_DIR1_BLOCK]);
+ return le32_to_cpu(rn->in.nid[off]);
+}
+
+/*
+ * Coldness identification:
+ * - Mark cold files in f2fs_inode_info
+ * - Mark cold node blocks in their node footer
+ * - Mark cold data pages in page cache
+ */
+static inline int is_cold_file(struct inode *inode)
+{
+ return F2FS_I(inode)->i_advise & FADVISE_COLD_BIT;
+}
+
+static inline int is_cold_data(struct page *page)
+{
+ return PageChecked(page);
+}
+
+static inline void set_cold_data(struct page *page)
+{
+ SetPageChecked(page);
+}
+
+static inline void clear_cold_data(struct page *page)
+{
+ ClearPageChecked(page);
+}
+
+static inline int is_cold_node(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << COLD_BIT_SHIFT);
+}
+
+static inline unsigned char is_fsync_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << FSYNC_BIT_SHIFT);
+}
+
+static inline unsigned char is_dent_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << DENT_BIT_SHIFT);
+}
+
+static inline void set_cold_node(struct inode *inode, struct page *page)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(page);
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+
+ if (S_ISDIR(inode->i_mode))
+ flag &= ~(0x1 << COLD_BIT_SHIFT);
+ else
+ flag |= (0x1 << COLD_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_fsync_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << FSYNC_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << FSYNC_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_dentry_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << DENT_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << DENT_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
new file mode 100644
index 0000000..e380a8e
--- /dev/null
+++ b/fs/f2fs/segment.h
@@ -0,0 +1,615 @@
+/**
+ * fs/f2fs/segment.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* constant macro */
+#define NULL_SEGNO ((unsigned int)(~0))
+
+/* V: Logical segment # in volume, R: Relative segment # in main area */
+#define GET_L2R_SEGNO(free_i, segno) (segno - free_i->start_segno)
+#define GET_R2L_SEGNO(free_i, segno) (segno + free_i->start_segno)
+
+#define IS_DATASEG(t) \
+ ((t == CURSEG_HOT_DATA) || (t == CURSEG_COLD_DATA) || \
+ (t == CURSEG_WARM_DATA))
+
+#define IS_NODESEG(t) \
+ ((t == CURSEG_HOT_NODE) || (t == CURSEG_COLD_NODE) || \
+ (t == CURSEG_WARM_NODE))
+
+#define IS_CURSEG(sbi, segno) \
+ ((segno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno))
+
+#define IS_CURSEC(sbi, secno) \
+ ((secno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno / \
+ sbi->segs_per_sec)) \
+
+#define START_BLOCK(sbi, segno) \
+ (SM_I(sbi)->seg0_blkaddr + \
+ (GET_R2L_SEGNO(FREE_I(sbi), segno) << sbi->log_blocks_per_seg))
+#define NEXT_FREE_BLKADDR(sbi, curseg) \
+ (START_BLOCK(sbi, curseg->segno) + curseg->next_blkoff)
+
+#define MAIN_BASE_BLOCK(sbi) (SM_I(sbi)->main_blkaddr)
+
+#define GET_SEGOFF_FROM_SEG0(sbi, blk_addr) \
+ ((blk_addr) - SM_I(sbi)->seg0_blkaddr)
+#define GET_SEGNO_FROM_SEG0(sbi, blk_addr) \
+ (GET_SEGOFF_FROM_SEG0(sbi, blk_addr) >> sbi->log_blocks_per_seg)
+#define GET_SEGNO(sbi, blk_addr) \
+ (((blk_addr == NULL_ADDR) || (blk_addr == NEW_ADDR)) ? \
+ NULL_SEGNO : GET_L2R_SEGNO(FREE_I(sbi), \
+ GET_SEGNO_FROM_SEG0(sbi, blk_addr)))
+#define GET_SECNO(sbi, segno) \
+ ((segno) / sbi->segs_per_sec)
+#define GET_ZONENO_FROM_SEGNO(sbi, segno) \
+ ((segno / sbi->segs_per_sec) / sbi->secs_per_zone)
+
+#define GET_SUM_BLOCK(sbi, segno) \
+ ((sbi->sm_info->ssa_blkaddr) + segno)
+
+#define GET_SUM_TYPE(footer) ((footer)->entry_type)
+#define SET_SUM_TYPE(footer, type) ((footer)->entry_type = type)
+
+#define SIT_ENTRY_OFFSET(sit_i, segno) \
+ (segno % sit_i->sents_per_block)
+#define SIT_BLOCK_OFFSET(sit_i, segno) \
+ (segno / SIT_ENTRY_PER_BLOCK)
+#define START_SEGNO(sit_i, segno) \
+ (SIT_BLOCK_OFFSET(sit_i, segno) * SIT_ENTRY_PER_BLOCK)
+#define f2fs_bitmap_size(nr) \
+ (BITS_TO_LONGS(nr) * sizeof(unsigned long))
+#define TOTAL_SEGS(sbi) (SM_I(sbi)->main_segments)
+
+/* during checkpoint, bio_private is used to synchronize the last bio */
+struct bio_private {
+ struct f2fs_sb_info *sbi;
+ bool is_sync;
+ void *wait;
+};
+
+/*
+ * indicate a block allocation direction: RIGHT and LEFT.
+ * RIGHT means allocating new sections towards the end of volume.
+ * LEFT means the opposite direction.
+ */
+enum {
+ ALLOC_RIGHT = 0,
+ ALLOC_LEFT
+};
+
+/*
+ * In the victim_sel_policy->alloc_mode, there are two block allocation modes.
+ * LFS writes data sequentially with cleaning operations.
+ * SSR (Slack Space Recycle) reuses obsolete space without cleaning operations.
+ */
+enum {
+ LFS = 0,
+ SSR
+};
+
+/*
+ * In the victim_sel_policy->gc_mode, there are two gc, aka cleaning, modes.
+ * GC_CB is based on cost-benefit algorithm.
+ * GC_GREEDY is based on greedy algorithm.
+ */
+enum {
+ GC_CB = 0,
+ GC_GREEDY
+};
+
+/*
+ * BG_GC means the background cleaning job.
+ * FG_GC means the on-demand cleaning job.
+ */
+enum {
+ BG_GC = 0,
+ FG_GC
+};
+
+/* for a function parameter to select a victim segment */
+struct victim_sel_policy {
+ int alloc_mode; /* LFS or SSR */
+ int gc_mode; /* GC_CB or GC_GREEDY */
+ unsigned long *dirty_segmap; /* dirty segment bitmap */
+ unsigned int offset; /* last scanned bitmap offset */
+ unsigned int ofs_unit; /* bitmap search unit */
+ unsigned int min_cost; /* minimum cost */
+ unsigned int min_segno; /* segment # having min. cost */
+};
+
+struct seg_entry {
+ unsigned short valid_blocks; /* # of valid blocks */
+ unsigned char *cur_valid_map; /* validity bitmap of blocks */
+ /*
+ * # of valid blocks and the validity bitmap stored in the the last
+ * checkpoint pack. This information is used by the SSR mode.
+ */
+ unsigned short ckpt_valid_blocks;
+ unsigned char *ckpt_valid_map;
+ unsigned char type; /* segment type like CURSEG_XXX_TYPE */
+ unsigned long long mtime; /* modification time of the segment */
+};
+
+struct sec_entry {
+ unsigned int valid_blocks; /* # of valid blocks in a section */
+};
+
+struct segment_allocation {
+ void (*allocate_segment)(struct f2fs_sb_info *, int, bool);
+};
+
+struct sit_info {
+ const struct segment_allocation *s_ops;
+
+ block_t sit_base_addr; /* start block address of SIT area */
+ block_t sit_blocks; /* # of blocks used by SIT area */
+ block_t written_valid_blocks; /* # of valid blocks in main area */
+ char *sit_bitmap; /* SIT bitmap pointer */
+ unsigned int bitmap_size; /* SIT bitmap size */
+
+ unsigned long *dirty_sentries_bitmap; /* bitmap for dirty sentries */
+ unsigned int dirty_sentries; /* # of dirty sentries */
+ unsigned int sents_per_block; /* # of SIT entries per block */
+ struct mutex sentry_lock; /* to protect SIT cache */
+ struct seg_entry *sentries; /* SIT segment-level cache */
+ struct sec_entry *sec_entries; /* SIT section-level cache */
+
+ /* for cost-benefit algorithm in cleaning procedure */
+ unsigned long long elapsed_time; /* elapsed time after mount */
+ unsigned long long mounted_time; /* mount time */
+ unsigned long long min_mtime; /* min. modification time */
+ unsigned long long max_mtime; /* max. modification time */
+};
+
+struct free_segmap_info {
+ unsigned int start_segno; /* start segment number logically */
+ unsigned int free_segments; /* # of free segments */
+ unsigned int free_sections; /* # of free sections */
+ rwlock_t segmap_lock; /* free segmap lock */
+ unsigned long *free_segmap; /* free segment bitmap */
+ unsigned long *free_secmap; /* free section bitmap */
+};
+
+/* Notice: The order of dirty type is same with CURSEG_XXX in f2fs.h */
+enum dirty_type {
+ DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */
+ DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */
+ DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */
+ DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */
+ DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */
+ DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
+ DIRTY, /* to count # of dirty segments */
+ PRE, /* to count # of entirely obsolete segments */
+ NR_DIRTY_TYPE
+};
+
+struct dirty_seglist_info {
+ const struct victim_selection *v_ops; /* victim selction operation */
+ unsigned long *dirty_segmap[NR_DIRTY_TYPE];
+ struct mutex seglist_lock; /* lock for segment bitmaps */
+ int nr_dirty[NR_DIRTY_TYPE]; /* # of dirty segments */
+ unsigned long *victim_segmap[2]; /* BG_GC, FG_GC */
+};
+
+/* victim selection function for cleaning and SSR */
+struct victim_selection {
+ int (*get_victim)(struct f2fs_sb_info *, unsigned int *,
+ int, int, char);
+};
+
+/* for active log information */
+struct curseg_info {
+ struct mutex curseg_mutex; /* lock for consistency */
+ struct f2fs_summary_block *sum_blk; /* cached summary block */
+ unsigned char alloc_type; /* current allocation type */
+ unsigned int segno; /* current segment number */
+ unsigned short next_blkoff; /* next block offset to write */
+ unsigned int zone; /* current zone number */
+ unsigned int next_segno; /* preallocated segment */
+};
+
+/*
+ * inline functions
+ */
+static inline struct curseg_info *CURSEG_I(struct f2fs_sb_info *sbi, int type)
+{
+ return (struct curseg_info *)(SM_I(sbi)->curseg_array + type);
+}
+
+static inline struct seg_entry *get_seg_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sentries[segno];
+}
+
+static inline struct sec_entry *get_sec_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sec_entries[GET_SECNO(sbi, segno)];
+}
+
+static inline unsigned int get_valid_blocks(struct f2fs_sb_info *sbi,
+ unsigned int segno, int section)
+{
+ /*
+ * In order to get # of valid blocks in a section instantly from many
+ * segments, f2fs manages two counting structures separately.
+ */
+ if (section > 1)
+ return get_sec_entry(sbi, segno)->valid_blocks;
+ else
+ return get_seg_entry(sbi, segno)->valid_blocks;
+}
+
+static inline void seg_info_from_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ se->valid_blocks = GET_SIT_VBLOCKS(rs);
+ se->ckpt_valid_blocks = GET_SIT_VBLOCKS(rs);
+ memcpy(se->cur_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->type = GET_SIT_TYPE(rs);
+ se->mtime = le64_to_cpu(rs->mtime);
+}
+
+static inline void seg_info_to_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ unsigned short raw_vblocks = (se->type << SIT_VBLOCKS_SHIFT) |
+ se->valid_blocks;
+ rs->vblocks = cpu_to_le16(raw_vblocks);
+ memcpy(rs->valid_map, se->cur_valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->ckpt_valid_blocks = se->valid_blocks;
+ rs->mtime = cpu_to_le64(se->mtime);
+}
+
+static inline unsigned int find_next_inuse(struct free_segmap_info *free_i,
+ unsigned int max, unsigned int segno)
+{
+ unsigned int ret;
+ read_lock(&free_i->segmap_lock);
+ ret = find_next_bit(free_i->free_segmap, max, segno);
+ read_unlock(&free_i->segmap_lock);
+ return ret;
+}
+
+static inline void __set_free(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ clear_bit(segno, free_i->free_segmap);
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi), start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ clear_bit(secno, free_i->free_secmap);
+ free_i->free_sections++;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ set_bit(segno, free_i->free_segmap);
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+}
+
+static inline void __set_test_and_free(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ if (test_and_clear_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi),
+ start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ if (test_and_clear_bit(secno, free_i->free_secmap))
+ free_i->free_sections++;
+ }
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_test_and_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ write_lock(&free_i->segmap_lock);
+ if (!test_and_set_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void get_sit_bitmap(struct f2fs_sb_info *sbi,
+ void *dst_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ memcpy(dst_addr, sit_i->sit_bitmap, sit_i->bitmap_size);
+}
+
+static inline block_t written_block_count(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_t vblocks;
+
+ mutex_lock(&sit_i->sentry_lock);
+ vblocks = sit_i->written_valid_blocks;
+ mutex_unlock(&sit_i->sentry_lock);
+
+ return vblocks;
+}
+
+static inline unsigned int free_segments(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_segs;
+
+ read_lock(&free_i->segmap_lock);
+ free_segs = free_i->free_segments;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_segs;
+}
+
+static inline int reserved_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->reserved_segments;
+}
+
+static inline unsigned int free_sections(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_secs;
+
+ read_lock(&free_i->segmap_lock);
+ free_secs = free_i->free_sections;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_secs;
+}
+
+static inline unsigned int prefree_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[PRE];
+}
+
+static inline unsigned int dirty_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_NODE];
+}
+
+static inline int overprovision_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->ovp_segments;
+}
+
+static inline int overprovision_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) overprovision_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline int reserved_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) reserved_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline bool need_SSR(struct f2fs_sb_info *sbi)
+{
+ return (free_sections(sbi) < overprovision_sections(sbi));
+}
+
+static inline int get_ssr_segment(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return DIRTY_I(sbi)->v_ops->get_victim(sbi,
+ &(curseg)->next_segno, BG_GC, type, SSR);
+}
+
+static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi)
+{
+ return free_sections(sbi) <= reserved_sections(sbi);
+}
+
+static inline int utilization(struct f2fs_sb_info *sbi)
+{
+ return (long int)valid_user_blocks(sbi) * 100 /
+ (long int)sbi->user_block_count;
+}
+
+/*
+ * Sometimes f2fs may be better to drop out-of-place update policy.
+ * So, if fs utilization is over MIN_IPU_UTIL, then f2fs tries to write
+ * data in the original place likewise other traditional file systems.
+ * But, currently set 100 in percentage, which means it is disabled.
+ * See below need_inplace_update().
+ */
+#define MIN_IPU_UTIL 100
+static inline bool need_inplace_update(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ return false;
+ if (need_SSR(sbi) && utilization(sbi) > MIN_IPU_UTIL)
+ return true;
+ return false;
+}
+
+static inline unsigned int curseg_segno(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->segno;
+}
+
+static inline unsigned char curseg_alloc_type(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->alloc_type;
+}
+
+static inline unsigned short curseg_blkoff(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->next_blkoff;
+}
+
+static inline void check_seg_range(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ unsigned int end_segno = SM_I(sbi)->segment_count - 1;
+ BUG_ON(segno > end_segno);
+}
+
+/*
+ * This function is used for only debugging.
+ * NOTE: In future, we have to remove this function.
+ */
+static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ block_t total_blks = sm_info->segment_count << sbi->log_blocks_per_seg;
+ block_t start_addr = sm_info->seg0_blkaddr;
+ block_t end_addr = start_addr + total_blks - 1;
+ BUG_ON(blk_addr < start_addr);
+ BUG_ON(blk_addr > end_addr);
+}
+
+/*
+ * Summary block is always treated as invalid block
+ */
+static inline void check_block_count(struct f2fs_sb_info *sbi,
+ int segno, struct f2fs_sit_entry *raw_sit)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ unsigned int end_segno = sm_info->segment_count - 1;
+ int valid_blocks = 0;
+ int i;
+
+ /* check segment usage */
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg);
+
+ /* check boundary of a given segment number */
+ BUG_ON(segno > end_segno);
+
+ /* check bitmap with valid block count */
+ for (i = 0; i < sbi->blocks_per_seg; i++)
+ if (f2fs_test_bit(i, raw_sit->valid_map))
+ valid_blocks++;
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) != valid_blocks);
+}
+
+static inline pgoff_t current_sit_addr(struct f2fs_sb_info *sbi,
+ unsigned int start)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int offset = SIT_BLOCK_OFFSET(sit_i, start);
+ block_t blk_addr = sit_i->sit_base_addr + offset;
+
+ check_seg_range(sbi, start);
+
+ /* calculate sit block address */
+ if (f2fs_test_bit(offset, sit_i->sit_bitmap))
+ blk_addr += sit_i->sit_blocks;
+
+ return blk_addr;
+}
+
+static inline pgoff_t next_sit_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_addr -= sit_i->sit_base_addr;
+ if (block_addr < sit_i->sit_blocks)
+ block_addr += sit_i->sit_blocks;
+ else
+ block_addr -= sit_i->sit_blocks;
+
+ return block_addr + sit_i->sit_base_addr;
+}
+
+static inline void set_to_next_sit(struct sit_info *sit_i, unsigned int start)
+{
+ unsigned int block_off = SIT_BLOCK_OFFSET(sit_i, start);
+
+ if (f2fs_test_bit(block_off, sit_i->sit_bitmap))
+ f2fs_clear_bit(block_off, sit_i->sit_bitmap);
+ else
+ f2fs_set_bit(block_off, sit_i->sit_bitmap);
+}
+
+static inline unsigned long long get_mtime(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return sit_i->elapsed_time + CURRENT_TIME_SEC.tv_sec -
+ sit_i->mounted_time;
+}
+
+static inline void set_summary(struct f2fs_summary *sum, nid_t nid,
+ unsigned int ofs_in_node, unsigned char version)
+{
+ sum->nid = cpu_to_le32(nid);
+ sum->ofs_in_node = cpu_to_le16(ofs_in_node);
+ sum->version = version;
+}
+
+static inline block_t start_sum_block(struct f2fs_sb_info *sbi)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline block_t sum_blk_addr(struct f2fs_sb_info *sbi, int base, int type)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_total_block_count)
+ - (base + 1) + type;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 15:51:46

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH 16/17] f2fs: move proc files to debugfs

On Wed, Oct 31, 2012 at 06:49:32PM +0900, Jaegeuk Kim wrote:
> This moves all of the f2fs debugging files into debugfs. The files are
> located in /sys/kernel/debug/f2fs/
>
> Note, I think we are generating all of the same information in each of
> the files for every unique f2fs filesystem in the machine. This copies
> the functionality that was present in the proc files, but this should be
> fixed up in the future.
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> Acked-by: Jaegeuk Kim <[email protected]>

As I wrote this, you need to keep the "From: Greg..." as the first line
in the changelog so it properly gets attributed when it gets applied.

See the file Documentation/SubmittingPatches for more details of this.

thanks,

greg k-h

2012-10-31 21:48:51

[permalink] [raw]

Subject: RE: [PATCH 16/17] f2fs: move proc files to debugfs

My apologies...
I should have examined the patch itself more carefully.
I'll resend.
Thanks,

---
Jaegeuk Kim
Samsung

> -----Original Message-----
> From: Greg KH [mailto:[email protected]]
> Sent: Thursday, November 01, 2012 12:52 AM
> To: Jaegeuk Kim
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 16/17] f2fs: move proc files to debugfs
> Importance: High
>
> On Wed, Oct 31, 2012 at 06:49:32PM +0900, Jaegeuk Kim wrote:
> > This moves all of the f2fs debugging files into debugfs. The files are
> > located in /sys/kernel/debug/f2fs/
> >
> > Note, I think we are generating all of the same information in each of
> > the files for every unique f2fs filesystem in the machine. This copies
> > the functionality that was present in the proc files, but this should be
> > fixed up in the future.
> >
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> > Acked-by: Jaegeuk Kim <[email protected]>
>
> As I wrote this, you need to keep the "From: Greg..." as the first line
> in the changelog so it properly gets attributed when it gets applied.
>
> See the file Documentation/SubmittingPatches for more details of this.
>
> thanks,
>
> greg k-h

2012-10-31 22:39:11

[permalink] [raw]

Subject: [PATCH 16/17 v2] f2fs: move proc files to debugfs

From: Greg Kroah-Hartman <[email protected]>

This moves all of the f2fs debugging files into debugfs. The files are
located in /sys/kernel/debug/f2fs/

Note, I think we are generating all of the same information in each of
the files for every unique f2fs filesystem in the machine. This copies
the functionality that was present in the proc files, but this should be
fixed up in the future.

Signed-off-by: Greg Kroah-Hartman <[email protected]>
[[email protected]: merged 3 debugfs entries into a *status* entry]
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/debug.c | 361 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 361 insertions(+)
create mode 100644 fs/f2fs/debug.c

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
new file mode 100644
index 0000000..a56181c
--- /dev/null
+++ b/fs/f2fs/debug.c
@@ -0,0 +1,361 @@
+/**
+ * f2fs debugging statistics
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ * Copyright (c) 2012 Linux Foundation
+ * Copyright (c) 2012 Greg Kroah-Hartman <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/f2fs_fs.h>
+#include <linux/blkdev.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "f2fs.h"
+#include "node.h"
+#include "segment.h"
+#include "gc.h"
+
+static LIST_HEAD(f2fs_stat_list);
+static struct dentry *debugfs_root;
+
+void update_general_status(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ int i;
+
+ /* valid check of the segment numbers */
+ si->hit_ext = sbi->read_hit_ext;
+ si->total_ext = sbi->total_hit_ext;
+ si->ndirty_node = get_pages(sbi, F2FS_DIRTY_NODES);
+ si->ndirty_dent = get_pages(sbi, F2FS_DIRTY_DENTS);
+ si->ndirty_dirs = sbi->n_dirty_dirs;
+ si->ndirty_meta = get_pages(sbi, F2FS_DIRTY_META);
+ si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
+ si->rsvd_segs = reserved_segments(sbi);
+ si->overp_segs = overprovision_segments(sbi);
+ si->valid_count = valid_user_blocks(sbi);
+ si->valid_node_count = valid_node_count(sbi);
+ si->valid_inode_count = valid_inode_count(sbi);
+ si->utilization = utilization(sbi);
+
+ si->free_segs = free_segments(sbi);
+ si->free_secs = free_sections(sbi);
+ si->prefree_count = prefree_segments(sbi);
+ si->dirty_count = dirty_segments(sbi);
+ si->node_pages = sbi->node_inode->i_mapping->nrpages;
+ si->meta_pages = sbi->meta_inode->i_mapping->nrpages;
+ si->nats = NM_I(sbi)->nat_cnt;
+ si->sits = SIT_I(sbi)->dirty_sentries;
+ si->fnids = NM_I(sbi)->fcnt;
+ si->bg_gc = sbi->bg_gc;
+ si->util_free = (int)(free_user_blocks(sbi) >> sbi->log_blocks_per_seg)
+ * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
+ / 2;
+ si->util_valid = (int)(written_block_count(sbi) >>
+ sbi->log_blocks_per_seg)
+ * 100 / (int)(sbi->user_block_count >> sbi->log_blocks_per_seg)
+ / 2;
+ si->util_invalid = 50 - si->util_free - si->util_valid;
+ for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_NODE; i++) {
+ struct curseg_info *curseg = CURSEG_I(sbi, i);
+ si->curseg[i] = curseg->segno;
+ si->cursec[i] = curseg->segno / sbi->segs_per_sec;
+ si->curzone[i] = si->cursec[i] / sbi->secs_per_zone;
+ }
+
+ for (i = 0; i < 2; i++) {
+ si->segment_count[i] = sbi->segment_count[i];
+ si->block_count[i] = sbi->block_count[i];
+ }
+}
+
+/**
+ * This function calculates BDF of every segments
+ */
+static void update_sit_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ unsigned int blks_per_sec, hblks_per_sec, total_vblocks, bimodal, dist;
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int segno, vblocks;
+ int ndirty = 0;
+
+ bimodal = 0;
+ total_vblocks = 0;
+ blks_per_sec = sbi->segs_per_sec * (1 << sbi->log_blocks_per_seg);
+ hblks_per_sec = blks_per_sec / 2;
+ mutex_lock(&sit_i->sentry_lock);
+ for (segno = 0; segno < TOTAL_SEGS(sbi); segno += sbi->segs_per_sec) {
+ vblocks = get_valid_blocks(sbi, segno, sbi->segs_per_sec);
+ dist = abs(vblocks - hblks_per_sec);
+ bimodal += dist * dist;
+
+ if (vblocks > 0 && vblocks < blks_per_sec) {
+ total_vblocks += vblocks;
+ ndirty++;
+ }
+ }
+ mutex_unlock(&sit_i->sentry_lock);
+ dist = sbi->total_sections * hblks_per_sec * hblks_per_sec / 100;
+ si->bimodal = bimodal / dist;
+ if (si->dirty_count)
+ si->avg_vblocks = total_vblocks / ndirty;
+ else
+ si->avg_vblocks = 0;
+}
+
+/**
+ * This function calculates memory footprint.
+ */
+static void update_mem_info(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+ unsigned npages;
+
+ if (si->base_mem)
+ goto get_cache;
+
+ si->base_mem = sizeof(struct f2fs_sb_info) + sbi->sb->s_blocksize;
+ si->base_mem += 2 * sizeof(struct f2fs_inode_info);
+ si->base_mem += sizeof(*sbi->ckpt);
+
+ /* build sm */
+ si->base_mem += sizeof(struct f2fs_sm_info);
+
+ /* build sit */
+ si->base_mem += sizeof(struct sit_info);
+ si->base_mem += TOTAL_SEGS(sbi) * sizeof(struct seg_entry);
+ si->base_mem += f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += 2 * SIT_VBLOCK_MAP_SIZE * TOTAL_SEGS(sbi);
+ if (sbi->segs_per_sec > 1)
+ si->base_mem += sbi->total_sections *
+ sizeof(struct sec_entry);
+ si->base_mem += __bitmap_size(sbi, SIT_BITMAP);
+
+ /* build free segmap */
+ si->base_mem += sizeof(struct free_segmap_info);
+ si->base_mem += f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += f2fs_bitmap_size(sbi->total_sections);
+
+ /* build curseg */
+ si->base_mem += sizeof(struct curseg_info) * NR_CURSEG_TYPE;
+ si->base_mem += PAGE_CACHE_SIZE * NR_CURSEG_TYPE;
+
+ /* build dirty segmap */
+ si->base_mem += sizeof(struct dirty_seglist_info);
+ si->base_mem += NR_DIRTY_TYPE * f2fs_bitmap_size(TOTAL_SEGS(sbi));
+ si->base_mem += 2 * f2fs_bitmap_size(TOTAL_SEGS(sbi));
+
+ /* buld nm */
+ si->base_mem += sizeof(struct f2fs_nm_info);
+ si->base_mem += __bitmap_size(sbi, NAT_BITMAP);
+
+ /* build gc */
+ si->base_mem += sizeof(struct f2fs_gc_kthread);
+
+get_cache:
+ /* free nids */
+ si->cache_mem = NM_I(sbi)->fcnt;
+ si->cache_mem += NM_I(sbi)->nat_cnt;
+ npages = sbi->node_inode->i_mapping->nrpages;
+ si->cache_mem += npages << PAGE_CACHE_SHIFT;
+ npages = sbi->meta_inode->i_mapping->nrpages;
+ si->cache_mem += npages << PAGE_CACHE_SHIFT;
+ si->cache_mem += sbi->n_orphans * sizeof(struct orphan_inode_entry);
+ si->cache_mem += sbi->n_dirty_dirs * sizeof(struct dir_inode_entry);
+}
+
+static int stat_show(struct seq_file *s, void *v)
+{
+ struct f2fs_stat_info *si, *next;
+ int i = 0;
+ int j;
+
+ list_for_each_entry_safe(si, next, &f2fs_stat_list, stat_list) {
+
+ mutex_lock(&si->stat_lock);
+ if (!si->sbi) {
+ mutex_unlock(&si->stat_lock);
+ continue;
+ }
+ update_general_status(si->sbi);
+
+ seq_printf(s, "\n=====[ partition info. #%d ]=====\n", i++);
+ seq_printf(s, "[SB: 1] [CP: 2] [NAT: %d] [SIT: %d] ",
+ si->nat_area_segs, si->sit_area_segs);
+ seq_printf(s, "[SSA: %d] [MAIN: %d",
+ si->ssa_area_segs, si->main_area_segs);
+ seq_printf(s, "(OverProv:%d Resv:%d)]\n\n",
+ si->overp_segs, si->rsvd_segs);
+ seq_printf(s, "Utilization: %d%% (%d valid blocks)\n",
+ si->utilization, si->valid_count);
+ seq_printf(s, " - Node: %u (Inode: %u, ",
+ si->valid_node_count, si->valid_inode_count);
+ seq_printf(s, "Other: %u)\n - Data: %u\n",
+ si->valid_node_count - si->valid_inode_count,
+ si->valid_count - si->valid_node_count);
+ seq_printf(s, "\nMain area: %d segs, %d secs %d zones\n",
+ si->main_area_segs, si->main_area_sections,
+ si->main_area_zones);
+ seq_printf(s, " - COLD data: %d, %d, %d\n",
+ si->curseg[CURSEG_COLD_DATA],
+ si->cursec[CURSEG_COLD_DATA],
+ si->curzone[CURSEG_COLD_DATA]);
+ seq_printf(s, " - WARM data: %d, %d, %d\n",
+ si->curseg[CURSEG_WARM_DATA],
+ si->cursec[CURSEG_WARM_DATA],
+ si->curzone[CURSEG_WARM_DATA]);
+ seq_printf(s, " - HOT data: %d, %d, %d\n",
+ si->curseg[CURSEG_HOT_DATA],
+ si->cursec[CURSEG_HOT_DATA],
+ si->curzone[CURSEG_HOT_DATA]);
+ seq_printf(s, " - Dir dnode: %d, %d, %d\n",
+ si->curseg[CURSEG_HOT_NODE],
+ si->cursec[CURSEG_HOT_NODE],
+ si->curzone[CURSEG_HOT_NODE]);
+ seq_printf(s, " - File dnode: %d, %d, %d\n",
+ si->curseg[CURSEG_WARM_NODE],
+ si->cursec[CURSEG_WARM_NODE],
+ si->curzone[CURSEG_WARM_NODE]);
+ seq_printf(s, " - Indir nodes: %d, %d, %d\n",
+ si->curseg[CURSEG_COLD_NODE],
+ si->cursec[CURSEG_COLD_NODE],
+ si->curzone[CURSEG_COLD_NODE]);
+ seq_printf(s, "\n - Valid: %d\n - Dirty: %d\n",
+ si->main_area_segs - si->dirty_count -
+ si->prefree_count - si->free_segs,
+ si->dirty_count);
+ seq_printf(s, " - Prefree: %d\n - Free: %d (%d)\n\n",
+ si->prefree_count, si->free_segs, si->free_secs);
+ seq_printf(s, "GC calls: %d (BG: %d)\n",
+ si->call_count, si->bg_gc);
+ seq_printf(s, " - data segments : %d\n", si->data_segs);
+ seq_printf(s, " - node segments : %d\n", si->node_segs);
+ seq_printf(s, "Try to move %d blocks\n", si->tot_blks);
+ seq_printf(s, " - data blocks : %d\n", si->data_blks);
+ seq_printf(s, " - node blocks : %d\n", si->node_blks);
+ seq_printf(s, "\nExtent Hit Ratio: %d / %d\n",
+ si->hit_ext, si->total_ext);
+ seq_printf(s, "\nBalancing F2FS Async:\n");
+ seq_printf(s, " - nodes %4d in %4d\n",
+ si->ndirty_node, si->node_pages);
+ seq_printf(s, " - dents %4d in dirs:%4d\n",
+ si->ndirty_dent, si->ndirty_dirs);
+ seq_printf(s, " - meta %4d in %4d\n",
+ si->ndirty_meta, si->meta_pages);
+ seq_printf(s, " - NATs %5d > %lu\n",
+ si->nats, NM_WOUT_THRESHOLD);
+ seq_printf(s, " - SITs: %5d\n - free_nids: %5d\n",
+ si->sits, si->fnids);
+ seq_printf(s, "\nDistribution of User Blocks:");
+ seq_printf(s, " [ valid | invalid | free ]\n");
+ seq_printf(s, " [");
+
+ for (j = 0; j < si->util_valid; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "|");
+
+ for (j = 0; j < si->util_invalid; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "|");
+
+ for (j = 0; j < si->util_free; j++)
+ seq_printf(s, "-");
+ seq_printf(s, "]\n\n");
+ seq_printf(s, "SSR: %u blocks in %u segments\n",
+ si->block_count[SSR], si->segment_count[SSR]);
+ seq_printf(s, "LFS: %u blocks in %u segments\n",
+ si->block_count[LFS], si->segment_count[LFS]);
+
+ /* segment usage info */
+ update_sit_info(si->sbi);
+ seq_printf(s, "\nBDF: %u, avg. vblocks: %u\n",
+ si->bimodal, si->avg_vblocks);
+
+ /* memory footprint */
+ update_mem_info(si->sbi);
+ seq_printf(s, "\nMemory: %u KB = static: %u + cached: %u\n",
+ (si->base_mem + si->cache_mem) >> 10,
+ si->base_mem >> 10, si->cache_mem >> 10);
+ mutex_unlock(&si->stat_lock);
+ }
+ return 0;
+}
+
+static int stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stat_show, inode->i_private);
+}
+
+static const struct file_operations stat_fops = {
+ .open = stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int init_stats(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_super_block *raw_super = F2FS_RAW_SUPER(sbi);
+ struct f2fs_stat_info *si;
+
+ sbi->stat_info = kzalloc(sizeof(struct f2fs_stat_info), GFP_KERNEL);
+ if (!sbi->stat_info)
+ return -ENOMEM;
+
+ si = sbi->stat_info;
+ mutex_init(&si->stat_lock);
+ list_add_tail(&si->stat_list, &f2fs_stat_list);
+
+ si->all_area_segs = le32_to_cpu(raw_super->segment_count);
+ si->sit_area_segs = le32_to_cpu(raw_super->segment_count_sit);
+ si->nat_area_segs = le32_to_cpu(raw_super->segment_count_nat);
+ si->ssa_area_segs = le32_to_cpu(raw_super->segment_count_ssa);
+ si->main_area_segs = le32_to_cpu(raw_super->segment_count_main);
+ si->main_area_sections = le32_to_cpu(raw_super->section_count);
+ si->main_area_zones = si->main_area_sections /
+ le32_to_cpu(raw_super->secs_per_zone);
+ si->sbi = sbi;
+ return 0;
+}
+
+int f2fs_build_stats(struct f2fs_sb_info *sbi)
+{
+ int retval;
+
+ retval = init_stats(sbi);
+ if (retval)
+ return retval;
+
+ if (!debugfs_root)
+ debugfs_root = debugfs_create_dir("f2fs", NULL);
+
+ debugfs_create_file("status", S_IRUGO, debugfs_root, NULL, &stat_fops);
+ return 0;
+}
+
+void f2fs_destroy_stats(struct f2fs_sb_info *sbi)
+{
+ struct f2fs_stat_info *si = sbi->stat_info;
+
+ list_del(&si->stat_list);
+ mutex_lock(&si->stat_lock);
+ si->sbi = NULL;
+ mutex_unlock(&si->stat_lock);
+ kfree(sbi->stat_info);
+}
+
+void destroy_root_stats(void)
+{
+ debugfs_remove_recursive(debugfs_root);
+ debugfs_root = NULL;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-10-31 22:50:46

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH 16/17 v2] f2fs: move proc files to debugfs

On Thu, Nov 01, 2012 at 07:38:12AM +0900, Jaegeuk Kim wrote:
> From: Greg Kroah-Hartman <[email protected]>
>
> This moves all of the f2fs debugging files into debugfs. The files are
> located in /sys/kernel/debug/f2fs/
>
> Note, I think we are generating all of the same information in each of
> the files for every unique f2fs filesystem in the machine. This copies
> the functionality that was present in the proc files, but this should be
> fixed up in the future.
>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> [[email protected]: merged 3 debugfs entries into a *status* entry]
> Signed-off-by: Jaegeuk Kim <[email protected]>

Thanks for the change.

As for the merge, that looks good to me.

greg k-h

2012-10-31 22:53:26

[permalink] [raw]

Subject: [PATCH 03/17 v3] f2fs: add superblock and major in-memory structure

f2fs: add superblock and major in-memory structure

This adds the following major in-memory structures in f2fs.

- f2fs_sb_info:
contains f2fs-specific information, two special inode pointers for node and
meta address spaces, and orphan inode management.

- f2fs_inode_info:
contains vfs_inode and other fs-specific information.

- f2fs_nm_info:
contains node manager information such as NAT entry cache, free nid list,
and NAT page management.

- f2fs_node_info:
represents a node as node id, inode number, block address, and its version.

- f2fs_sm_info:
contains segment manager information such as SIT entry cache, free segment
map, current active logs, dirty segment management, and segment utilization.
The specific structures are sit_info, free_segmap_info, dirty_seglist_info,
curseg_info.

Signed-off-by: Chul Lee <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
v3: Remove change log from the patch
v2: Resolve checkpatch.pl errors
---
fs/f2fs/f2fs.h | 1062 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/node.h | 353 ++++++++++++++++++
fs/f2fs/segment.h | 615 +++++++++++++++++++++++++++++++
3 files changed, 2030 insertions(+)
create mode 100644 fs/f2fs/f2fs.h
create mode 100644 fs/f2fs/node.h
create mode 100644 fs/f2fs/segment.h

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
new file mode 100644
index 0000000..7aa70b5
--- /dev/null
+++ b/fs/f2fs/f2fs.h
@@ -0,0 +1,1062 @@
+/**
+ * fs/f2fs/f2fs.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _LINUX_F2FS_H
+#define _LINUX_F2FS_H
+
+#include <linux/types.h>
+#include <linux/page-flags.h>
+#include <linux/buffer_head.h>
+#include <linux/version.h>
+#include <linux/slab.h>
+#include <linux/crc32.h>
+#include <linux/magic.h>
+
+/*
+ * For mount options
+ */
+#define F2FS_MOUNT_BG_GC 0x00000001
+#define F2FS_MOUNT_DISABLE_ROLL_FORWARD 0x00000002
+#define F2FS_MOUNT_DISCARD 0x00000004
+#define F2FS_MOUNT_NOHEAP 0x00000008
+#define F2FS_MOUNT_XATTR_USER 0x00000010
+#define F2FS_MOUNT_POSIX_ACL 0x00000020
+#define F2FS_MOUNT_DISABLE_EXT_IDENTIFY 0x00000040
+
+#define clear_opt(sbi, option) (sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
+#define set_opt(sbi, option) (sbi->mount_opt.opt |= F2FS_MOUNT_##option)
+#define test_opt(sbi, option) (sbi->mount_opt.opt & F2FS_MOUNT_##option)
+
+#define ver_after(a, b) (typecheck(unsigned long long, a) && \
+ typecheck(unsigned long long, b) && \
+ ((long long)((a) - (b)) > 0))
+
+typedef u64 block_t;
+typedef u32 nid_t;
+
+struct f2fs_mount_info {
+ unsigned int opt;
+};
+
+static inline __u32 f2fs_crc32(void *buff, size_t len)
+{
+ return crc32_le(F2FS_SUPER_MAGIC, buff, len);
+}
+
+static inline bool f2fs_crc_valid(__u32 blk_crc, void *buff, size_t buff_size)
+{
+ return f2fs_crc32(buff, buff_size) == blk_crc;
+}
+
+/*
+ * For checkpoint manager
+ */
+enum {
+ NAT_BITMAP,
+ SIT_BITMAP
+};
+
+/* for the list of orphan inodes */
+struct orphan_inode_entry {
+ struct list_head list; /* list head */
+ nid_t ino; /* inode number */
+};
+
+/* for the list of directory inodes */
+struct dir_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+};
+
+/* for the list of fsync inodes, used only during recovery */
+struct fsync_inode_entry {
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+ block_t blkaddr; /* block address locating the last inode */
+};
+
+#define nats_in_cursum(sum) (le16_to_cpu(sum->n_nats))
+#define sits_in_cursum(sum) (le16_to_cpu(sum->n_sits))
+
+#define nat_in_journal(sum, i) (sum->nat_j.entries[i].ne)
+#define nid_in_journal(sum, i) (sum->nat_j.entries[i].nid)
+#define sit_in_journal(sum, i) (sum->sit_j.entries[i].se)
+#define segno_in_journal(sum, i) (sum->sit_j.entries[i].segno)
+
+static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = nats_in_cursum(rs);
+ rs->n_nats = cpu_to_le16(before + i);
+ return before;
+}
+
+static inline int update_sits_in_cursum(struct f2fs_summary_block *rs, int i)
+{
+ int before = sits_in_cursum(rs);
+ rs->n_sits = cpu_to_le16(before + i);
+ return before;
+}
+
+/*
+ * For INODE and NODE manager
+ */
+#define XATTR_NODE_OFFSET (-1) /*
+ * store xattrs to one node block per
+ * file keeping -1 as its node offset to
+ * distinguish from index node blocks.
+ */
+#define RDONLY_NODE 1 /*
+ * specify a read-only mode when getting
+ * a node block. 0 is read-write mode.
+ * used by get_dnode_of_data().
+ */
+#define F2FS_LINK_MAX 32000 /* maximum link count per file */
+
+/* for in-memory extent cache entry */
+struct extent_info {
+ rwlock_t ext_lock; /* rwlock for consistency */
+ unsigned int fofs; /* start offset in a file */
+ u32 blk_addr; /* start block address of the extent */
+ unsigned int len; /* lenth of the extent */
+};
+
+/*
+ * i_advise uses FADVISE_XXX_BIT. We can add additional hints later.
+ */
+#define FADVISE_COLD_BIT 0x01
+
+struct f2fs_inode_info {
+ struct inode vfs_inode; /* serve a vfs inode */
+ unsigned long i_flags; /* keep an inode flags for ioctl */
+ unsigned char i_advise; /* use to give file attribute hints */
+ unsigned int i_current_depth; /* use only in directory structure */
+ umode_t i_acl_mode; /* keep file acl mode temporarily */
+
+ /* Use below internally in f2fs*/
+ unsigned long flags; /* use to pass per-file flags */
+ unsigned long long data_version;/* lastes version of data for fsync */
+ atomic_t dirty_dents; /* # of dirty dentry pages */
+ f2fs_hash_t chash; /* hash value of given file name */
+ unsigned int clevel; /* maximum level of given file name */
+ nid_t i_xattr_nid; /* node id that contains xattrs */
+ struct extent_info ext; /* in-memory extent cache entry */
+};
+
+static inline void get_extent_info(struct extent_info *ext,
+ struct f2fs_extent i_ext)
+{
+ write_lock(&ext->ext_lock);
+ ext->fofs = le32_to_cpu(i_ext.fofs);
+ ext->blk_addr = le32_to_cpu(i_ext.blk_addr);
+ ext->len = le32_to_cpu(i_ext.len);
+ write_unlock(&ext->ext_lock);
+}
+
+static inline void set_raw_extent(struct extent_info *ext,
+ struct f2fs_extent *i_ext)
+{
+ read_lock(&ext->ext_lock);
+ i_ext->fofs = cpu_to_le32(ext->fofs);
+ i_ext->blk_addr = cpu_to_le32(ext->blk_addr);
+ i_ext->len = cpu_to_le32(ext->len);
+ read_unlock(&ext->ext_lock);
+}
+
+struct f2fs_nm_info {
+ block_t nat_blkaddr; /* base disk address of NAT */
+ nid_t max_nid; /* maximum possible node ids */
+ nid_t init_scan_nid; /* the first nid to be scanned */
+ nid_t next_scan_nid; /* the next nid to be scanned */
+
+ /* NAT cache management */
+ struct radix_tree_root nat_root;/* root of the nat entry cache */
+ rwlock_t nat_tree_lock; /* protect nat_tree_lock */
+ unsigned int nat_cnt; /* the # of cached nat entries */
+ struct list_head nat_entries; /* cached nat entry list (clean) */
+ struct list_head dirty_nat_entries; /* cached nat entry list (dirty) */
+
+ /* free node ids management */
+ struct list_head free_nid_list; /* a list for free nids */
+ spinlock_t free_nid_list_lock; /* protect free nid list */
+ unsigned int fcnt; /* the number of free node id */
+ struct mutex build_lock; /* lock for build free nids */
+
+ /* for checkpoint */
+ char *nat_bitmap; /* NAT bitmap pointer */
+ int bitmap_size; /* bitmap size */
+};
+
+/*
+ * this structure is used as one of function parameters.
+ * all the information are dedicated to a given direct node block determined
+ * by the data offset in a file.
+ */
+struct dnode_of_data {
+ struct inode *inode; /* vfs inode pointer */
+ struct page *inode_page; /* its inode page, NULL is possible */
+ struct page *node_page; /* cached direct node page */
+ nid_t nid; /* node id of the direct node block */
+ unsigned int ofs_in_node; /* data offset in the node page */
+ bool inode_page_locked; /* inode page is locked or not */
+ block_t data_blkaddr; /* block address of the node block */
+};
+
+static inline void set_new_dnode(struct dnode_of_data *dn, struct inode *inode,
+ struct page *ipage, struct page *npage, nid_t nid)
+{
+ dn->inode = inode;
+ dn->inode_page = ipage;
+ dn->node_page = npage;
+ dn->nid = nid;
+ dn->inode_page_locked = 0;
+}
+
+/*
+ * For SIT manager
+ *
+ * By default, there are 6 active log areas across the whole main area.
+ * When considering hot and cold data separation to reduce cleaning overhead,
+ * we split 3 for data logs and 3 for node logs as hot, warm, and cold types,
+ * respectively.
+ * In the current design, you should not change the numbers intentionally.
+ * Instead, as a mount option such as active_logs=x, you can use 2, 4, and 6
+ * logs individually according to the underlying devices. (default: 6)
+ * Just in case, on-disk layout covers maximum 16 logs that consist of 8 for
+ * data and 8 for node logs.
+ */
+#define NR_CURSEG_DATA_TYPE (3)
+#define NR_CURSEG_NODE_TYPE (3)
+#define NR_CURSEG_TYPE (NR_CURSEG_DATA_TYPE + NR_CURSEG_NODE_TYPE)
+
+enum {
+ CURSEG_HOT_DATA = 0, /* directory entry blocks */
+ CURSEG_WARM_DATA, /* data blocks */
+ CURSEG_COLD_DATA, /* multimedia or GCed data blocks */
+ CURSEG_HOT_NODE, /* direct node blocks of directory files */
+ CURSEG_WARM_NODE, /* direct node blocks of normal files */
+ CURSEG_COLD_NODE, /* indirect node blocks */
+ NO_CHECK_TYPE
+};
+
+struct f2fs_sm_info {
+ struct sit_info *sit_info; /* whole segment information */
+ struct free_segmap_info *free_info; /* free segment information */
+ struct dirty_seglist_info *dirty_info; /* dirty segment information */
+ struct curseg_info *curseg_array; /* active segment information */
+
+ struct list_head wblist_head; /* list of under-writeback pages */
+ spinlock_t wblist_lock; /* lock for checkpoint */
+
+ block_t seg0_blkaddr; /* block address of 0'th segment */
+ block_t main_blkaddr; /* start block address of main area */
+ block_t ssa_blkaddr; /* start block address of SSA area */
+
+ unsigned int segment_count; /* total # of segments */
+ unsigned int main_segments; /* # of segments in main area */
+ unsigned int reserved_segments; /* # of reserved segments */
+ unsigned int ovp_segments; /* # of overprovision segments */
+};
+
+/*
+ * For directory operation
+ */
+#define NODE_DIR1_BLOCK (ADDRS_PER_INODE + 1)
+#define NODE_DIR2_BLOCK (ADDRS_PER_INODE + 2)
+#define NODE_IND1_BLOCK (ADDRS_PER_INODE + 3)
+#define NODE_IND2_BLOCK (ADDRS_PER_INODE + 4)
+#define NODE_DIND_BLOCK (ADDRS_PER_INODE + 5)
+
+/*
+ * For superblock
+ */
+/*
+ * COUNT_TYPE for monitoring
+ *
+ * f2fs monitors the number of several block types such as on-writeback,
+ * dirty dentry blocks, dirty node blocks, and dirty meta blocks.
+ */
+enum count_type {
+ F2FS_WRITEBACK,
+ F2FS_DIRTY_DENTS,
+ F2FS_DIRTY_NODES,
+ F2FS_DIRTY_META,
+ NR_COUNT_TYPE,
+};
+
+/*
+ * FS_LOCK nesting subclasses for the lock validator:
+ *
+ * The locking order between these classes is
+ * RENAME -> DENTRY_OPS -> DATA_WRITE -> DATA_NEW
+ * -> DATA_TRUNC -> NODE_WRITE -> NODE_NEW -> NODE_TRUNC
+ */
+enum lock_type {
+ RENAME, /* for renaming operations */
+ DENTRY_OPS, /* for directory operations */
+ DATA_WRITE, /* for data write */
+ DATA_NEW, /* for data allocation */
+ DATA_TRUNC, /* for data truncate */
+ NODE_NEW, /* for node allocation */
+ NODE_TRUNC, /* for node truncate */
+ NODE_WRITE, /* for node write */
+ NR_LOCK_TYPE,
+};
+
+/*
+ * The below are the page types of bios used in submti_bio().
+ * The available types are:
+ * DATA User data pages. It operates as async mode.
+ * NODE Node pages. It operates as async mode.
+ * META FS metadata pages such as SIT, NAT, CP.
+ * NR_PAGE_TYPE The number of page types.
+ * META_FLUSH Make sure the previous pages are written
+ * with waiting the bio's completion
+ * ... Only can be used with META.
+ */
+enum page_type {
+ DATA,
+ NODE,
+ META,
+ NR_PAGE_TYPE,
+ META_FLUSH,
+};
+
+struct f2fs_sb_info {
+ struct super_block *sb; /* pointer to VFS super block */
+ struct buffer_head *raw_super_buf; /* buffer head of raw sb */
+ struct f2fs_super_block *raw_super; /* raw super block pointer */
+ int s_dirty; /* dirty flag for checkpoint */
+
+ /* for node-related operations */
+ struct f2fs_nm_info *nm_info; /* node manager */
+ struct inode *node_inode; /* cache node blocks */
+
+ /* for segment-related operations */
+ struct f2fs_sm_info *sm_info; /* segment manager */
+ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */
+ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */
+ struct rw_semaphore bio_sem; /* IO semaphore */
+
+ /* for checkpoint */
+ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */
+ struct inode *meta_inode; /* cache meta blocks */
+ struct mutex cp_mutex; /* for checkpoint procedure */
+ struct mutex fs_lock[NR_LOCK_TYPE]; /* for blocking FS operations */
+ struct mutex write_inode; /* mutex for write inode */
+ struct mutex writepages; /* mutex for writepages() */
+ int por_doing; /* recovery is doing or not */
+
+ /* for orphan inode management */
+ struct list_head orphan_inode_list; /* orphan inode list */
+ struct mutex orphan_inode_mutex; /* for orphan inode list */
+ unsigned int n_orphans; /* # of orphan inodes */
+
+ /* for directory inode management */
+ struct list_head dir_inode_list; /* dir inode list */
+ spinlock_t dir_inode_lock; /* for dir inode list lock */
+ unsigned int n_dirty_dirs; /* # of dir inodes */
+
+ /* basic file system units */
+ unsigned int log_sectors_per_block; /* log2 sectors per block */
+ unsigned int log_blocksize; /* log2 block size */
+ unsigned int blocksize; /* block size */
+ unsigned int root_ino_num; /* root inode number*/
+ unsigned int node_ino_num; /* node inode number*/
+ unsigned int meta_ino_num; /* meta inode number*/
+ unsigned int log_blocks_per_seg; /* log2 blocks per segment */
+ unsigned int blocks_per_seg; /* blocks per segment */
+ unsigned int segs_per_sec; /* segments per section */
+ unsigned int secs_per_zone; /* sections per zone */
+ unsigned int total_sections; /* total section count */
+ unsigned int total_node_count; /* total node block count */
+ unsigned int total_valid_node_count; /* valid node block count */
+ unsigned int total_valid_inode_count; /* valid inode count */
+ int active_logs; /* # of active logs */
+
+ block_t user_block_count; /* # of user blocks */
+ block_t total_valid_block_count; /* # of valid blocks */
+ block_t alloc_valid_block_count; /* # of allocated blocks */
+ block_t last_valid_block_count; /* for recovery */
+ u32 s_next_generation; /* for NFS support */
+ atomic_t nr_pages[NR_COUNT_TYPE]; /* # of pages, see count_type */
+
+ struct f2fs_mount_info mount_opt; /* mount options */
+
+ /* for cleaning operations */
+ struct mutex gc_mutex; /* mutex for GC */
+ struct f2fs_gc_kthread *gc_thread; /* GC thread */
+
+ /*
+ * for stat information.
+ * one is for the LFS mode, and the other is for the SSR mode.
+ */
+ struct f2fs_stat_info *stat_info; /* FS status information */
+ unsigned int segment_count[2]; /* # of allocated segments */
+ unsigned int block_count[2]; /* # of allocated blocks */
+ unsigned int last_victim[2]; /* last victim segment # */
+ int total_hit_ext, read_hit_ext; /* extent cache hit ratio */
+ int bg_gc; /* background gc calls */
+ spinlock_t stat_lock; /* lock for stat operations */
+};
+
+/*
+ * Inline functions
+ */
+static inline struct f2fs_inode_info *F2FS_I(struct inode *inode)
+{
+ return container_of(inode, struct f2fs_inode_info, vfs_inode);
+}
+
+static inline struct f2fs_sb_info *F2FS_SB(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
+static inline struct f2fs_super_block *F2FS_RAW_SUPER(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_super_block *)(sbi->raw_super);
+}
+
+static inline struct f2fs_checkpoint *F2FS_CKPT(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_checkpoint *)(sbi->ckpt);
+}
+
+static inline struct f2fs_nm_info *NM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_nm_info *)(sbi->nm_info);
+}
+
+static inline struct f2fs_sm_info *SM_I(struct f2fs_sb_info *sbi)
+{
+ return (struct f2fs_sm_info *)(sbi->sm_info);
+}
+
+static inline struct sit_info *SIT_I(struct f2fs_sb_info *sbi)
+{
+ return (struct sit_info *)(SM_I(sbi)->sit_info);
+}
+
+static inline struct free_segmap_info *FREE_I(struct f2fs_sb_info *sbi)
+{
+ return (struct free_segmap_info *)(SM_I(sbi)->free_info);
+}
+
+static inline struct dirty_seglist_info *DIRTY_I(struct f2fs_sb_info *sbi)
+{
+ return (struct dirty_seglist_info *)(SM_I(sbi)->dirty_info);
+}
+
+static inline void F2FS_SET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 1;
+}
+
+static inline void F2FS_RESET_SB_DIRT(struct f2fs_sb_info *sbi)
+{
+ sbi->s_dirty = 0;
+}
+
+static inline void mutex_lock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_lock_nested(&sbi->fs_lock[t], t);
+}
+
+static inline void mutex_unlock_op(struct f2fs_sb_info *sbi, enum lock_type t)
+{
+ mutex_unlock(&sbi->fs_lock[t]);
+}
+
+/*
+ * Check whether the given nid is within node id range.
+ */
+static inline void check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
+{
+ BUG_ON((nid >= NM_I(sbi)->max_nid));
+}
+
+#define F2FS_DEFAULT_ALLOCATED_BLOCKS 1
+
+/*
+ * Check whether the inode has blocks or not
+ */
+static inline int F2FS_HAS_BLOCKS(struct inode *inode)
+{
+ if (F2FS_I(inode)->i_xattr_nid)
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS + 1);
+ else
+ return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS);
+}
+
+static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode, blkcnt_t count)
+{
+ block_t valid_block_count;
+
+ spin_lock(&sbi->stat_lock);
+ valid_block_count =
+ sbi->total_valid_block_count + (block_t)count;
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+ inode->i_blocks += count;
+ sbi->total_valid_block_count = valid_block_count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return true;
+}
+
+static inline int dec_valid_block_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ blkcnt_t count)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_block_count < (block_t) count);
+ BUG_ON(inode->i_blocks < count);
+ inode->i_blocks -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline void inc_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_inc(&sbi->nr_pages[count_type]);
+ F2FS_SET_SB_DIRT(sbi);
+}
+
+static inline void inode_inc_dirty_dents(struct inode *inode)
+{
+ atomic_inc(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline void dec_page_count(struct f2fs_sb_info *sbi, int count_type)
+{
+ atomic_dec(&sbi->nr_pages[count_type]);
+}
+
+static inline void inode_dec_dirty_dents(struct inode *inode)
+{
+ atomic_dec(&F2FS_I(inode)->dirty_dents);
+}
+
+static inline int get_pages(struct f2fs_sb_info *sbi, int count_type)
+{
+ return atomic_read(&sbi->nr_pages[count_type]);
+}
+
+static inline block_t valid_user_blocks(struct f2fs_sb_info *sbi)
+{
+ block_t ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline unsigned long __bitmap_size(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+
+ /* return NAT or SIT bitmap */
+ if (flag == NAT_BITMAP)
+ return le32_to_cpu(ckpt->nat_ver_bitmap_bytesize);
+ else if (flag == SIT_BITMAP)
+ return le32_to_cpu(ckpt->sit_ver_bitmap_bytesize);
+
+ return 0;
+}
+
+static inline void *__bitmap_ptr(struct f2fs_sb_info *sbi, int flag)
+{
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ int offset = (flag == NAT_BITMAP) ? ckpt->sit_ver_bitmap_bytesize : 0;
+ return &ckpt->sit_nat_version_bitmap + offset;
+}
+
+static inline block_t __start_cp_addr(struct f2fs_sb_info *sbi)
+{
+ block_t start_addr;
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ unsigned long long ckpt_version = le64_to_cpu(ckpt->checkpoint_ver);
+
+ start_addr = le64_to_cpu(F2FS_RAW_SUPER(sbi)->cp_blkaddr);
+
+ /*
+ * odd numbered checkpoint should at cp segment 0
+ * and even segent must be at cp segment 1
+ */
+ if (!(ckpt_version & 1))
+ start_addr += sbi->blocks_per_seg;
+
+ return start_addr;
+}
+
+static inline block_t __start_sum_addr(struct f2fs_sb_info *sbi)
+{
+ return le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline bool inc_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ block_t valid_block_count;
+ unsigned int valid_node_count;
+
+ spin_lock(&sbi->stat_lock);
+
+ valid_block_count = sbi->total_valid_block_count + (block_t)count;
+ sbi->alloc_valid_block_count += (block_t)count;
+ valid_node_count = sbi->total_valid_node_count + count;
+
+ if (valid_block_count > sbi->user_block_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (valid_node_count > sbi->total_node_count) {
+ spin_unlock(&sbi->stat_lock);
+ return false;
+ }
+
+ if (inode)
+ inode->i_blocks += count;
+ sbi->total_valid_node_count = valid_node_count;
+ sbi->total_valid_block_count = valid_block_count;
+ spin_unlock(&sbi->stat_lock);
+
+ return true;
+}
+
+static inline void dec_valid_node_count(struct f2fs_sb_info *sbi,
+ struct inode *inode,
+ unsigned int count)
+{
+ spin_lock(&sbi->stat_lock);
+
+ BUG_ON(sbi->total_valid_block_count < count);
+ BUG_ON(sbi->total_valid_node_count < count);
+ BUG_ON(inode->i_blocks < count);
+
+ inode->i_blocks -= count;
+ sbi->total_valid_node_count -= count;
+ sbi->total_valid_block_count -= (block_t)count;
+
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline unsigned int valid_node_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_node_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void inc_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(sbi->total_valid_inode_count == sbi->total_node_count);
+ sbi->total_valid_inode_count++;
+ spin_unlock(&sbi->stat_lock);
+}
+
+static inline int dec_valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ spin_lock(&sbi->stat_lock);
+ BUG_ON(!sbi->total_valid_inode_count);
+ sbi->total_valid_inode_count--;
+ spin_unlock(&sbi->stat_lock);
+ return 0;
+}
+
+static inline unsigned int valid_inode_count(struct f2fs_sb_info *sbi)
+{
+ unsigned int ret;
+ spin_lock(&sbi->stat_lock);
+ ret = sbi->total_valid_inode_count;
+ spin_unlock(&sbi->stat_lock);
+ return ret;
+}
+
+static inline void f2fs_put_page(struct page *page, int unlock)
+{
+ if (!page || IS_ERR(page))
+ return;
+
+ if (unlock) {
+ BUG_ON(!PageLocked(page));
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+static inline void f2fs_put_dnode(struct dnode_of_data *dn)
+{
+ if (dn->node_page)
+ f2fs_put_page(dn->node_page, 1);
+ if (dn->inode_page && dn->node_page != dn->inode_page)
+ f2fs_put_page(dn->inode_page, 0);
+ dn->node_page = NULL;
+ dn->inode_page = NULL;
+}
+
+static inline struct kmem_cache *f2fs_kmem_cache_create(const char *name,
+ size_t size, void (*ctor)(void *))
+{
+ return kmem_cache_create(name, size, 0, SLAB_RECLAIM_ACCOUNT, ctor);
+}
+
+#define RAW_IS_INODE(p) ((p)->footer.nid == (p)->footer.ino)
+
+static inline bool IS_INODE(struct page *page)
+{
+ struct f2fs_node *p = (struct f2fs_node *)page_address(page);
+ return RAW_IS_INODE(p);
+}
+
+static inline __le32 *blkaddr_in_node(struct f2fs_node *node)
+{
+ return RAW_IS_INODE(node) ? node->i.i_addr : node->dn.addr;
+}
+
+static inline block_t datablock_addr(struct page *node_page,
+ unsigned int offset)
+{
+ struct f2fs_node *raw_node;
+ __le32 *addr_array;
+ raw_node = (struct f2fs_node *)page_address(node_page);
+ addr_array = blkaddr_in_node(raw_node);
+ return le32_to_cpu(addr_array[offset]);
+}
+
+static inline int f2fs_test_bit(unsigned int nr, char *addr)
+{
+ int mask;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ return mask & *addr;
+}
+
+static inline int f2fs_set_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr |= mask;
+ return ret;
+}
+
+static inline int f2fs_clear_bit(unsigned int nr, char *addr)
+{
+ int mask;
+ int ret;
+
+ addr += (nr >> 3);
+ mask = 1 << (7 - (nr & 0x07));
+ ret = mask & *addr;
+ *addr &= ~mask;
+ return ret;
+}
+
+/* used for f2fs_inode_info->flags */
+enum {
+ FI_NEW_INODE, /* indicate newly allocated inode */
+ FI_NEED_CP, /* need to do checkpoint during fsync */
+ FI_INC_LINK, /* need to increment i_nlink */
+ FI_ACL_MODE, /* indicate acl mode */
+ FI_NO_ALLOC, /* should not allocate any blocks */
+};
+
+static inline void set_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ set_bit(flag, &fi->flags);
+}
+
+static inline int is_inode_flag_set(struct f2fs_inode_info *fi, int flag)
+{
+ return test_bit(flag, &fi->flags);
+}
+
+static inline void clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ clear_bit(flag, &fi->flags);
+}
+
+static inline void set_acl_inode(struct f2fs_inode_info *fi, umode_t mode)
+{
+ fi->i_acl_mode = mode;
+ set_inode_flag(fi, FI_ACL_MODE);
+}
+
+static inline int cond_clear_inode_flag(struct f2fs_inode_info *fi, int flag)
+{
+ if (is_inode_flag_set(fi, FI_ACL_MODE)) {
+ clear_inode_flag(fi, FI_ACL_MODE);
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * file.c
+ */
+int f2fs_sync_file(struct file *, loff_t, loff_t, int);
+void truncate_data_blocks(struct dnode_of_data *);
+void f2fs_truncate(struct inode *);
+int f2fs_setattr(struct dentry *, struct iattr *);
+int truncate_hole(struct inode *, pgoff_t, pgoff_t);
+long f2fs_ioctl(struct file *, unsigned int, unsigned long);
+
+/*
+ * inode.c
+ */
+void f2fs_set_inode_flags(struct inode *);
+struct inode *f2fs_iget_nowait(struct super_block *, unsigned long);
+struct inode *f2fs_iget(struct super_block *, unsigned long);
+void update_inode(struct inode *, struct page *);
+int f2fs_write_inode(struct inode *, struct writeback_control *);
+void f2fs_evict_inode(struct inode *);
+
+/*
+ * namei.c
+ */
+struct dentry *f2fs_get_parent(struct dentry *child);
+
+/*
+ * dir.c
+ */
+struct f2fs_dir_entry *f2fs_find_entry(struct inode *, struct qstr *,
+ struct page **);
+struct f2fs_dir_entry *f2fs_parent_dir(struct inode *, struct page **);
+ino_t f2fs_inode_by_name(struct inode *, struct qstr *);
+void f2fs_set_link(struct inode *, struct f2fs_dir_entry *,
+ struct page *, struct inode *);
+void init_dent_inode(struct dentry *, struct page *);
+int f2fs_add_link(struct dentry *, struct inode *);
+void f2fs_delete_entry(struct f2fs_dir_entry *, struct page *, struct inode *);
+int f2fs_make_empty(struct inode *, struct inode *);
+bool f2fs_empty_dir(struct inode *);
+
+/*
+ * super.c
+ */
+int f2fs_sync_fs(struct super_block *, int);
+
+/*
+ * hash.c
+ */
+f2fs_hash_t f2fs_dentry_hash(const char *, int);
+
+/*
+ * node.c
+ */
+struct dnode_of_data;
+struct node_info;
+
+int is_checkpointed_node(struct f2fs_sb_info *, nid_t);
+void get_node_info(struct f2fs_sb_info *, nid_t, struct node_info *);
+int get_dnode_of_data(struct dnode_of_data *, pgoff_t, int);
+int truncate_inode_blocks(struct inode *, pgoff_t);
+int remove_inode_page(struct inode *);
+int new_inode_page(struct inode *, struct dentry *);
+struct page *new_node_page(struct dnode_of_data *, unsigned int);
+void ra_node_page(struct f2fs_sb_info *, nid_t);
+struct page *get_node_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_node_page_ra(struct page *, int);
+void sync_inode_page(struct dnode_of_data *);
+int sync_node_pages(struct f2fs_sb_info *, nid_t, struct writeback_control *);
+bool alloc_nid(struct f2fs_sb_info *, nid_t *);
+void alloc_nid_done(struct f2fs_sb_info *, nid_t);
+void alloc_nid_failed(struct f2fs_sb_info *, nid_t);
+void recover_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, struct node_info *, block_t);
+int recover_inode_page(struct f2fs_sb_info *, struct page *);
+int restore_node_summary(struct f2fs_sb_info *, unsigned int,
+ struct f2fs_summary_block *);
+void flush_nat_entries(struct f2fs_sb_info *);
+int build_node_manager(struct f2fs_sb_info *);
+void destroy_node_manager(struct f2fs_sb_info *);
+int create_node_manager_caches(void);
+void destroy_node_manager_caches(void);
+
+/*
+ * segment.c
+ */
+void f2fs_balance_fs(struct f2fs_sb_info *);
+void invalidate_blocks(struct f2fs_sb_info *, block_t);
+void locate_dirty_segment(struct f2fs_sb_info *, unsigned int);
+void clear_prefree_segments(struct f2fs_sb_info *);
+int npages_for_summary_flush(struct f2fs_sb_info *);
+void allocate_new_segments(struct f2fs_sb_info *);
+struct page *get_sum_page(struct f2fs_sb_info *, unsigned int);
+struct bio *f2fs_bio_alloc(struct block_device *, sector_t, int, gfp_t);
+void f2fs_submit_bio(struct f2fs_sb_info *, enum page_type, bool sync);
+int write_meta_page(struct f2fs_sb_info *, struct page *,
+ struct writeback_control *);
+void write_node_page(struct f2fs_sb_info *, struct page *, unsigned int,
+ block_t, block_t *);
+void write_data_page(struct inode *, struct page *, struct dnode_of_data*,
+ block_t, block_t *);
+void rewrite_data_page(struct f2fs_sb_info *, struct page *, block_t);
+void recover_data_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void rewrite_node_page(struct f2fs_sb_info *, struct page *,
+ struct f2fs_summary *, block_t, block_t);
+void write_data_summaries(struct f2fs_sb_info *, block_t);
+void write_node_summaries(struct f2fs_sb_info *, block_t);
+int lookup_journal_in_cursum(struct f2fs_summary_block *,
+ int, unsigned int, int);
+void flush_sit_entries(struct f2fs_sb_info *);
+int build_segment_manager(struct f2fs_sb_info *);
+void reset_victim_segmap(struct f2fs_sb_info *);
+void destroy_segment_manager(struct f2fs_sb_info *);
+
+/*
+ * checkpoint.c
+ */
+struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t);
+struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t);
+long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long);
+int check_orphan_space(struct f2fs_sb_info *);
+void add_orphan_inode(struct f2fs_sb_info *, nid_t);
+void remove_orphan_inode(struct f2fs_sb_info *, nid_t);
+int recover_orphan_inodes(struct f2fs_sb_info *);
+int get_valid_checkpoint(struct f2fs_sb_info *);
+void set_dirty_dir_page(struct inode *, struct page *);
+void remove_dirty_dir_inode(struct inode *);
+void sync_dirty_dir_inodes(struct f2fs_sb_info *);
+void block_operations(struct f2fs_sb_info *);
+void write_checkpoint(struct f2fs_sb_info *, bool, bool);
+void init_orphan_info(struct f2fs_sb_info *);
+int create_checkpoint_caches(void);
+void destroy_checkpoint_caches(void);
+
+/*
+ * data.c
+ */
+int reserve_new_block(struct dnode_of_data *);
+void update_extent_cache(block_t, struct dnode_of_data *);
+struct page *find_data_page(struct inode *, pgoff_t);
+struct page *get_lock_data_page(struct inode *, pgoff_t);
+struct page *get_new_data_page(struct inode *, pgoff_t, bool);
+int f2fs_readpage(struct f2fs_sb_info *, struct page *, block_t, int);
+int do_write_data_page(struct page *);
+
+/*
+ * gc.c
+ */
+int start_gc_thread(struct f2fs_sb_info *);
+void stop_gc_thread(struct f2fs_sb_info *);
+block_t start_bidx_of_node(unsigned int);
+int f2fs_gc(struct f2fs_sb_info *, int);
+void build_gc_manager(struct f2fs_sb_info *);
+int create_gc_caches(void);
+void destroy_gc_caches(void);
+
+/*
+ * recovery.c
+ */
+void recover_fsync_data(struct f2fs_sb_info *);
+bool space_for_roll_forward(struct f2fs_sb_info *);
+
+/*
+ * debug.c
+ */
+#ifdef CONFIG_F2FS_STAT_FS
+struct f2fs_stat_info {
+ struct list_head stat_list;
+ struct f2fs_sb_info *sbi;
+ struct mutex stat_lock;
+ int all_area_segs, sit_area_segs, nat_area_segs, ssa_area_segs;
+ int main_area_segs, main_area_sections, main_area_zones;
+ int hit_ext, total_ext;
+ int ndirty_node, ndirty_dent, ndirty_dirs, ndirty_meta;
+ int nats, sits, fnids;
+ int total_count, utilization;
+ int bg_gc;
+ unsigned int valid_count, valid_node_count, valid_inode_count;
+ unsigned int bimodal, avg_vblocks;
+ int util_free, util_valid, util_invalid;
+ int rsvd_segs, overp_segs;
+ int dirty_count, node_pages, meta_pages;
+ int prefree_count, call_count;
+ int tot_segs, node_segs, data_segs, free_segs, free_secs;
+ int tot_blks, data_blks, node_blks;
+ int curseg[NR_CURSEG_TYPE];
+ int cursec[NR_CURSEG_TYPE];
+ int curzone[NR_CURSEG_TYPE];
+
+ unsigned int segment_count[2];
+ unsigned int block_count[2];
+ unsigned base_mem, cache_mem;
+};
+
+#define stat_inc_call_count(si) ((si)->call_count++)
+
+#define stat_inc_seg_count(sbi, type) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ (si)->tot_segs++; \
+ if (type == SUM_TYPE_DATA) \
+ si->data_segs++; \
+ else \
+ si->node_segs++; \
+ } while (0)
+
+#define stat_inc_tot_blk_count(si, blks) \
+ (si->tot_blks += (blks))
+
+#define stat_inc_data_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->data_blks += (blks); \
+ } while (0)
+
+#define stat_inc_node_blk_count(sbi, blks) \
+ do { \
+ struct f2fs_stat_info *si = sbi->stat_info; \
+ stat_inc_tot_blk_count(si, blks); \
+ si->node_blks += (blks); \
+ } while (0)
+
+int f2fs_build_stats(struct f2fs_sb_info *);
+void f2fs_destroy_stats(struct f2fs_sb_info *);
+void destroy_root_stats(void);
+#else
+#define stat_inc_call_count(si)
+#define stat_inc_seg_count(si, type)
+#define stat_inc_tot_blk_count(si, blks)
+#define stat_inc_data_blk_count(si, blks)
+#define stat_inc_node_blk_count(sbi, blks)
+
+static inline int f2fs_build_stats(struct f2fs_sb_info *sbi) { return 0; }
+static inline void f2fs_destroy_stats(struct f2fs_sb_info *sbi) { }
+static inline void destroy_root_stats(void) { }
+#endif
+
+extern const struct file_operations f2fs_dir_operations;
+extern const struct file_operations f2fs_file_operations;
+extern const struct inode_operations f2fs_file_inode_operations;
+extern const struct address_space_operations f2fs_dblock_aops;
+extern const struct address_space_operations f2fs_node_aops;
+extern const struct address_space_operations f2fs_meta_aops;
+extern const struct inode_operations f2fs_dir_inode_operations;
+extern const struct inode_operations f2fs_symlink_inode_operations;
+extern const struct inode_operations f2fs_special_inode_operations;
+#endif
diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
new file mode 100644
index 0000000..5d525ed
--- /dev/null
+++ b/fs/f2fs/node.h
@@ -0,0 +1,353 @@
+/**
+ * fs/f2fs/node.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* start node id of a node block dedicated to the given node id */
+#define START_NID(nid) ((nid / NAT_ENTRY_PER_BLOCK) * NAT_ENTRY_PER_BLOCK)
+
+/* node block offset on the NAT area dedicated to the given start node id */
+#define NAT_BLOCK_OFFSET(start_nid) (start_nid / NAT_ENTRY_PER_BLOCK)
+
+/* # of pages to perform readahead before building free nids */
+#define FREE_NID_PAGES 4
+
+/* maximum # of free node ids to produce during build_free_nids */
+#define MAX_FREE_NIDS (NAT_ENTRY_PER_BLOCK * FREE_NID_PAGES)
+
+/* maximum readahead size for node during getting data blocks */
+#define MAX_RA_NODE 128
+
+/* maximum cached nat entries to manage memory footprint */
+#define NM_WOUT_THRESHOLD (64 * NAT_ENTRY_PER_BLOCK)
+
+/* vector size for gang look-up from nat cache that consists of radix tree */
+#define NATVEC_SIZE 64
+
+/*
+ * For node information
+ */
+struct node_info {
+ nid_t nid; /* node id */
+ nid_t ino; /* inode number of the node's owner */
+ block_t blk_addr; /* block address of the node */
+ unsigned char version; /* version of the node */
+};
+
+struct nat_entry {
+ struct list_head list; /* for clean or dirty nat list */
+ bool checkpointed; /* whether it is checkpointed or not */
+ struct node_info ni; /* in-memory node information */
+};
+
+#define nat_get_nid(nat) (nat->ni.nid)
+#define nat_set_nid(nat, n) (nat->ni.nid = n)
+#define nat_get_blkaddr(nat) (nat->ni.blk_addr)
+#define nat_set_blkaddr(nat, b) (nat->ni.blk_addr = b)
+#define nat_get_ino(nat) (nat->ni.ino)
+#define nat_set_ino(nat, i) (nat->ni.ino = i)
+#define nat_get_version(nat) (nat->ni.version)
+#define nat_set_version(nat, v) (nat->ni.version = v)
+
+#define __set_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->dirty_nat_entries);
+#define __clear_nat_cache_dirty(nm_i, ne) \
+ list_move_tail(&ne->list, &nm_i->nat_entries);
+#define inc_node_version(version) (++version)
+
+static inline void node_info_from_raw_nat(struct node_info *ni,
+ struct f2fs_nat_entry *raw_ne)
+{
+ ni->ino = le32_to_cpu(raw_ne->ino);
+ ni->blk_addr = le32_to_cpu(raw_ne->block_addr);
+ ni->version = raw_ne->version;
+}
+
+/*
+ * For free nid mangement
+ */
+enum nid_state {
+ NID_NEW, /* newly added to free nid list */
+ NID_ALLOC /* it is allocated */
+};
+
+struct free_nid {
+ struct list_head list; /* for free node id list */
+ nid_t nid; /* node id */
+ int state; /* in use or not: NID_NEW or NID_ALLOC */
+};
+
+static inline int next_free_nid(struct f2fs_sb_info *sbi, nid_t *nid)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ struct free_nid *fnid;
+
+ if (nm_i->fcnt <= 0)
+ return -1;
+ spin_lock(&nm_i->free_nid_list_lock);
+ fnid = list_entry(nm_i->free_nid_list.next, struct free_nid, list);
+ *nid = fnid->nid;
+ spin_unlock(&nm_i->free_nid_list_lock);
+ return 0;
+}
+
+/*
+ * inline functions
+ */
+static inline void get_nat_bitmap(struct f2fs_sb_info *sbi, void *addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ memcpy(addr, nm_i->nat_bitmap, nm_i->bitmap_size);
+}
+
+static inline pgoff_t current_nat_addr(struct f2fs_sb_info *sbi, nid_t start)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+ pgoff_t block_off;
+ pgoff_t block_addr;
+ int seg_off;
+
+ block_off = NAT_BLOCK_OFFSET(start);
+ seg_off = block_off >> sbi->log_blocks_per_seg;
+
+ block_addr = (pgoff_t)(nm_i->nat_blkaddr +
+ (seg_off << sbi->log_blocks_per_seg << 1) +
+ (block_off & ((1 << sbi->log_blocks_per_seg) - 1)));
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr;
+}
+
+static inline pgoff_t next_nat_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct f2fs_nm_info *nm_i = NM_I(sbi);
+
+ block_addr -= nm_i->nat_blkaddr;
+ if ((block_addr >> sbi->log_blocks_per_seg) % 2)
+ block_addr -= sbi->blocks_per_seg;
+ else
+ block_addr += sbi->blocks_per_seg;
+
+ return block_addr + nm_i->nat_blkaddr;
+}
+
+static inline void set_to_next_nat(struct f2fs_nm_info *nm_i, nid_t start_nid)
+{
+ unsigned int block_off = NAT_BLOCK_OFFSET(start_nid);
+
+ if (f2fs_test_bit(block_off, nm_i->nat_bitmap))
+ f2fs_clear_bit(block_off, nm_i->nat_bitmap);
+ else
+ f2fs_set_bit(block_off, nm_i->nat_bitmap);
+}
+
+static inline void fill_node_footer(struct page *page, nid_t nid,
+ nid_t ino, unsigned int ofs, bool reset)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ if (reset)
+ memset(rn, 0, sizeof(*rn));
+ rn->footer.nid = cpu_to_le32(nid);
+ rn->footer.ino = cpu_to_le32(ino);
+ rn->footer.flag = cpu_to_le32(ofs << OFFSET_BIT_SHIFT);
+}
+
+static inline void copy_node_footer(struct page *dst, struct page *src)
+{
+ void *src_addr = page_address(src);
+ void *dst_addr = page_address(dst);
+ struct f2fs_node *src_rn = (struct f2fs_node *)src_addr;
+ struct f2fs_node *dst_rn = (struct f2fs_node *)dst_addr;
+ memcpy(&dst_rn->footer, &src_rn->footer, sizeof(struct node_footer));
+}
+
+static inline void fill_node_footer_blkaddr(struct page *page, block_t blkaddr)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(page->mapping->host->i_sb);
+ struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ rn->footer.cp_ver = ckpt->checkpoint_ver;
+ rn->footer.next_blkaddr = blkaddr;
+}
+
+static inline nid_t ino_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.ino);
+}
+
+static inline nid_t nid_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.nid);
+}
+
+static inline unsigned int ofs_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned flag = le32_to_cpu(rn->footer.flag);
+ return flag >> OFFSET_BIT_SHIFT;
+}
+
+static inline unsigned long long cpver_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le64_to_cpu(rn->footer.cp_ver);
+}
+
+static inline block_t next_blkaddr_of_node(struct page *node_page)
+{
+ void *kaddr = page_address(node_page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ return le32_to_cpu(rn->footer.next_blkaddr);
+}
+
+/*
+ * f2fs assigns the following node offsets described as (num).
+ * N = NIDS_PER_BLOCK
+ *
+ * Inode block (0)
+ * |- direct node (1)
+ * |- direct node (2)
+ * |- indirect node (3)
+ * | `- direct node (4 => 4 + N - 1)
+ * |- indirect node (4 + N)
+ * | `- direct node (5 + N => 5 + 2N - 1)
+ * `- double indirect node (5 + 2N)
+ * `- indirect node (6 + 2N)
+ * `- direct node (x(N + 1))
+ */
+static inline bool IS_DNODE(struct page *node_page)
+{
+ unsigned int ofs = ofs_of_node(node_page);
+ if (ofs == 3 || ofs == 4 + NIDS_PER_BLOCK ||
+ ofs == 5 + 2 * NIDS_PER_BLOCK)
+ return false;
+ if (ofs >= 6 + 2 * NIDS_PER_BLOCK) {
+ ofs -= 6 + 2 * NIDS_PER_BLOCK;
+ if ((long int)ofs % (NIDS_PER_BLOCK + 1))
+ return false;
+ }
+ return true;
+}
+
+static inline void set_nid(struct page *p, int off, nid_t nid, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+
+ wait_on_page_writeback(p);
+
+ if (i)
+ rn->i.i_nid[off - NODE_DIR1_BLOCK] = cpu_to_le32(nid);
+ else
+ rn->in.nid[off] = cpu_to_le32(nid);
+ set_page_dirty(p);
+}
+
+static inline nid_t get_nid(struct page *p, int off, bool i)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(p);
+ if (i)
+ return le32_to_cpu(rn->i.i_nid[off - NODE_DIR1_BLOCK]);
+ return le32_to_cpu(rn->in.nid[off]);
+}
+
+/*
+ * Coldness identification:
+ * - Mark cold files in f2fs_inode_info
+ * - Mark cold node blocks in their node footer
+ * - Mark cold data pages in page cache
+ */
+static inline int is_cold_file(struct inode *inode)
+{
+ return F2FS_I(inode)->i_advise & FADVISE_COLD_BIT;
+}
+
+static inline int is_cold_data(struct page *page)
+{
+ return PageChecked(page);
+}
+
+static inline void set_cold_data(struct page *page)
+{
+ SetPageChecked(page);
+}
+
+static inline void clear_cold_data(struct page *page)
+{
+ ClearPageChecked(page);
+}
+
+static inline int is_cold_node(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << COLD_BIT_SHIFT);
+}
+
+static inline unsigned char is_fsync_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << FSYNC_BIT_SHIFT);
+}
+
+static inline unsigned char is_dent_dnode(struct page *page)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ return flag & (0x1 << DENT_BIT_SHIFT);
+}
+
+static inline void set_cold_node(struct inode *inode, struct page *page)
+{
+ struct f2fs_node *rn = (struct f2fs_node *)page_address(page);
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+
+ if (S_ISDIR(inode->i_mode))
+ flag &= ~(0x1 << COLD_BIT_SHIFT);
+ else
+ flag |= (0x1 << COLD_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_fsync_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << FSYNC_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << FSYNC_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
+
+static inline void set_dentry_mark(struct page *page, int mark)
+{
+ void *kaddr = page_address(page);
+ struct f2fs_node *rn = (struct f2fs_node *)kaddr;
+ unsigned int flag = le32_to_cpu(rn->footer.flag);
+ if (mark)
+ flag |= (0x1 << DENT_BIT_SHIFT);
+ else
+ flag &= ~(0x1 << DENT_BIT_SHIFT);
+ rn->footer.flag = cpu_to_le32(flag);
+}
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
new file mode 100644
index 0000000..e380a8e
--- /dev/null
+++ b/fs/f2fs/segment.h
@@ -0,0 +1,615 @@
+/**
+ * fs/f2fs/segment.h
+ *
+ * Copyright (c) 2012 Samsung Electronics Co., Ltd.
+ * http://www.samsung.com/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+/* constant macro */
+#define NULL_SEGNO ((unsigned int)(~0))
+
+/* V: Logical segment # in volume, R: Relative segment # in main area */
+#define GET_L2R_SEGNO(free_i, segno) (segno - free_i->start_segno)
+#define GET_R2L_SEGNO(free_i, segno) (segno + free_i->start_segno)
+
+#define IS_DATASEG(t) \
+ ((t == CURSEG_HOT_DATA) || (t == CURSEG_COLD_DATA) || \
+ (t == CURSEG_WARM_DATA))
+
+#define IS_NODESEG(t) \
+ ((t == CURSEG_HOT_NODE) || (t == CURSEG_COLD_NODE) || \
+ (t == CURSEG_WARM_NODE))
+
+#define IS_CURSEG(sbi, segno) \
+ ((segno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno) || \
+ (segno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno))
+
+#define IS_CURSEC(sbi, secno) \
+ ((secno == CURSEG_I(sbi, CURSEG_HOT_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_DATA)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_HOT_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_WARM_NODE)->segno / \
+ sbi->segs_per_sec) || \
+ (secno == CURSEG_I(sbi, CURSEG_COLD_NODE)->segno / \
+ sbi->segs_per_sec)) \
+
+#define START_BLOCK(sbi, segno) \
+ (SM_I(sbi)->seg0_blkaddr + \
+ (GET_R2L_SEGNO(FREE_I(sbi), segno) << sbi->log_blocks_per_seg))
+#define NEXT_FREE_BLKADDR(sbi, curseg) \
+ (START_BLOCK(sbi, curseg->segno) + curseg->next_blkoff)
+
+#define MAIN_BASE_BLOCK(sbi) (SM_I(sbi)->main_blkaddr)
+
+#define GET_SEGOFF_FROM_SEG0(sbi, blk_addr) \
+ ((blk_addr) - SM_I(sbi)->seg0_blkaddr)
+#define GET_SEGNO_FROM_SEG0(sbi, blk_addr) \
+ (GET_SEGOFF_FROM_SEG0(sbi, blk_addr) >> sbi->log_blocks_per_seg)
+#define GET_SEGNO(sbi, blk_addr) \
+ (((blk_addr == NULL_ADDR) || (blk_addr == NEW_ADDR)) ? \
+ NULL_SEGNO : GET_L2R_SEGNO(FREE_I(sbi), \
+ GET_SEGNO_FROM_SEG0(sbi, blk_addr)))
+#define GET_SECNO(sbi, segno) \
+ ((segno) / sbi->segs_per_sec)
+#define GET_ZONENO_FROM_SEGNO(sbi, segno) \
+ ((segno / sbi->segs_per_sec) / sbi->secs_per_zone)
+
+#define GET_SUM_BLOCK(sbi, segno) \
+ ((sbi->sm_info->ssa_blkaddr) + segno)
+
+#define GET_SUM_TYPE(footer) ((footer)->entry_type)
+#define SET_SUM_TYPE(footer, type) ((footer)->entry_type = type)
+
+#define SIT_ENTRY_OFFSET(sit_i, segno) \
+ (segno % sit_i->sents_per_block)
+#define SIT_BLOCK_OFFSET(sit_i, segno) \
+ (segno / SIT_ENTRY_PER_BLOCK)
+#define START_SEGNO(sit_i, segno) \
+ (SIT_BLOCK_OFFSET(sit_i, segno) * SIT_ENTRY_PER_BLOCK)
+#define f2fs_bitmap_size(nr) \
+ (BITS_TO_LONGS(nr) * sizeof(unsigned long))
+#define TOTAL_SEGS(sbi) (SM_I(sbi)->main_segments)
+
+/* during checkpoint, bio_private is used to synchronize the last bio */
+struct bio_private {
+ struct f2fs_sb_info *sbi;
+ bool is_sync;
+ void *wait;
+};
+
+/*
+ * indicate a block allocation direction: RIGHT and LEFT.
+ * RIGHT means allocating new sections towards the end of volume.
+ * LEFT means the opposite direction.
+ */
+enum {
+ ALLOC_RIGHT = 0,
+ ALLOC_LEFT
+};
+
+/*
+ * In the victim_sel_policy->alloc_mode, there are two block allocation modes.
+ * LFS writes data sequentially with cleaning operations.
+ * SSR (Slack Space Recycle) reuses obsolete space without cleaning operations.
+ */
+enum {
+ LFS = 0,
+ SSR
+};
+
+/*
+ * In the victim_sel_policy->gc_mode, there are two gc, aka cleaning, modes.
+ * GC_CB is based on cost-benefit algorithm.
+ * GC_GREEDY is based on greedy algorithm.
+ */
+enum {
+ GC_CB = 0,
+ GC_GREEDY
+};
+
+/*
+ * BG_GC means the background cleaning job.
+ * FG_GC means the on-demand cleaning job.
+ */
+enum {
+ BG_GC = 0,
+ FG_GC
+};
+
+/* for a function parameter to select a victim segment */
+struct victim_sel_policy {
+ int alloc_mode; /* LFS or SSR */
+ int gc_mode; /* GC_CB or GC_GREEDY */
+ unsigned long *dirty_segmap; /* dirty segment bitmap */
+ unsigned int offset; /* last scanned bitmap offset */
+ unsigned int ofs_unit; /* bitmap search unit */
+ unsigned int min_cost; /* minimum cost */
+ unsigned int min_segno; /* segment # having min. cost */
+};
+
+struct seg_entry {
+ unsigned short valid_blocks; /* # of valid blocks */
+ unsigned char *cur_valid_map; /* validity bitmap of blocks */
+ /*
+ * # of valid blocks and the validity bitmap stored in the the last
+ * checkpoint pack. This information is used by the SSR mode.
+ */
+ unsigned short ckpt_valid_blocks;
+ unsigned char *ckpt_valid_map;
+ unsigned char type; /* segment type like CURSEG_XXX_TYPE */
+ unsigned long long mtime; /* modification time of the segment */
+};
+
+struct sec_entry {
+ unsigned int valid_blocks; /* # of valid blocks in a section */
+};
+
+struct segment_allocation {
+ void (*allocate_segment)(struct f2fs_sb_info *, int, bool);
+};
+
+struct sit_info {
+ const struct segment_allocation *s_ops;
+
+ block_t sit_base_addr; /* start block address of SIT area */
+ block_t sit_blocks; /* # of blocks used by SIT area */
+ block_t written_valid_blocks; /* # of valid blocks in main area */
+ char *sit_bitmap; /* SIT bitmap pointer */
+ unsigned int bitmap_size; /* SIT bitmap size */
+
+ unsigned long *dirty_sentries_bitmap; /* bitmap for dirty sentries */
+ unsigned int dirty_sentries; /* # of dirty sentries */
+ unsigned int sents_per_block; /* # of SIT entries per block */
+ struct mutex sentry_lock; /* to protect SIT cache */
+ struct seg_entry *sentries; /* SIT segment-level cache */
+ struct sec_entry *sec_entries; /* SIT section-level cache */
+
+ /* for cost-benefit algorithm in cleaning procedure */
+ unsigned long long elapsed_time; /* elapsed time after mount */
+ unsigned long long mounted_time; /* mount time */
+ unsigned long long min_mtime; /* min. modification time */
+ unsigned long long max_mtime; /* max. modification time */
+};
+
+struct free_segmap_info {
+ unsigned int start_segno; /* start segment number logically */
+ unsigned int free_segments; /* # of free segments */
+ unsigned int free_sections; /* # of free sections */
+ rwlock_t segmap_lock; /* free segmap lock */
+ unsigned long *free_segmap; /* free segment bitmap */
+ unsigned long *free_secmap; /* free section bitmap */
+};
+
+/* Notice: The order of dirty type is same with CURSEG_XXX in f2fs.h */
+enum dirty_type {
+ DIRTY_HOT_DATA, /* dirty segments assigned as hot data logs */
+ DIRTY_WARM_DATA, /* dirty segments assigned as warm data logs */
+ DIRTY_COLD_DATA, /* dirty segments assigned as cold data logs */
+ DIRTY_HOT_NODE, /* dirty segments assigned as hot node logs */
+ DIRTY_WARM_NODE, /* dirty segments assigned as warm node logs */
+ DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
+ DIRTY, /* to count # of dirty segments */
+ PRE, /* to count # of entirely obsolete segments */
+ NR_DIRTY_TYPE
+};
+
+struct dirty_seglist_info {
+ const struct victim_selection *v_ops; /* victim selction operation */
+ unsigned long *dirty_segmap[NR_DIRTY_TYPE];
+ struct mutex seglist_lock; /* lock for segment bitmaps */
+ int nr_dirty[NR_DIRTY_TYPE]; /* # of dirty segments */
+ unsigned long *victim_segmap[2]; /* BG_GC, FG_GC */
+};
+
+/* victim selection function for cleaning and SSR */
+struct victim_selection {
+ int (*get_victim)(struct f2fs_sb_info *, unsigned int *,
+ int, int, char);
+};
+
+/* for active log information */
+struct curseg_info {
+ struct mutex curseg_mutex; /* lock for consistency */
+ struct f2fs_summary_block *sum_blk; /* cached summary block */
+ unsigned char alloc_type; /* current allocation type */
+ unsigned int segno; /* current segment number */
+ unsigned short next_blkoff; /* next block offset to write */
+ unsigned int zone; /* current zone number */
+ unsigned int next_segno; /* preallocated segment */
+};
+
+/*
+ * inline functions
+ */
+static inline struct curseg_info *CURSEG_I(struct f2fs_sb_info *sbi, int type)
+{
+ return (struct curseg_info *)(SM_I(sbi)->curseg_array + type);
+}
+
+static inline struct seg_entry *get_seg_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sentries[segno];
+}
+
+static inline struct sec_entry *get_sec_entry(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return &sit_i->sec_entries[GET_SECNO(sbi, segno)];
+}
+
+static inline unsigned int get_valid_blocks(struct f2fs_sb_info *sbi,
+ unsigned int segno, int section)
+{
+ /*
+ * In order to get # of valid blocks in a section instantly from many
+ * segments, f2fs manages two counting structures separately.
+ */
+ if (section > 1)
+ return get_sec_entry(sbi, segno)->valid_blocks;
+ else
+ return get_seg_entry(sbi, segno)->valid_blocks;
+}
+
+static inline void seg_info_from_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ se->valid_blocks = GET_SIT_VBLOCKS(rs);
+ se->ckpt_valid_blocks = GET_SIT_VBLOCKS(rs);
+ memcpy(se->cur_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->type = GET_SIT_TYPE(rs);
+ se->mtime = le64_to_cpu(rs->mtime);
+}
+
+static inline void seg_info_to_raw_sit(struct seg_entry *se,
+ struct f2fs_sit_entry *rs)
+{
+ unsigned short raw_vblocks = (se->type << SIT_VBLOCKS_SHIFT) |
+ se->valid_blocks;
+ rs->vblocks = cpu_to_le16(raw_vblocks);
+ memcpy(rs->valid_map, se->cur_valid_map, SIT_VBLOCK_MAP_SIZE);
+ memcpy(se->ckpt_valid_map, rs->valid_map, SIT_VBLOCK_MAP_SIZE);
+ se->ckpt_valid_blocks = se->valid_blocks;
+ rs->mtime = cpu_to_le64(se->mtime);
+}
+
+static inline unsigned int find_next_inuse(struct free_segmap_info *free_i,
+ unsigned int max, unsigned int segno)
+{
+ unsigned int ret;
+ read_lock(&free_i->segmap_lock);
+ ret = find_next_bit(free_i->free_segmap, max, segno);
+ read_unlock(&free_i->segmap_lock);
+ return ret;
+}
+
+static inline void __set_free(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ clear_bit(segno, free_i->free_segmap);
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi), start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ clear_bit(secno, free_i->free_secmap);
+ free_i->free_sections++;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ set_bit(segno, free_i->free_segmap);
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+}
+
+static inline void __set_test_and_free(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ unsigned int start_segno = secno * sbi->segs_per_sec;
+ unsigned int next;
+
+ write_lock(&free_i->segmap_lock);
+ if (test_and_clear_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments++;
+
+ next = find_next_bit(free_i->free_segmap, TOTAL_SEGS(sbi),
+ start_segno);
+ if (next >= start_segno + sbi->segs_per_sec) {
+ if (test_and_clear_bit(secno, free_i->free_secmap))
+ free_i->free_sections++;
+ }
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void __set_test_and_inuse(struct f2fs_sb_info *sbi,
+ unsigned int segno)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int secno = segno / sbi->segs_per_sec;
+ write_lock(&free_i->segmap_lock);
+ if (!test_and_set_bit(segno, free_i->free_segmap)) {
+ free_i->free_segments--;
+ if (!test_and_set_bit(secno, free_i->free_secmap))
+ free_i->free_sections--;
+ }
+ write_unlock(&free_i->segmap_lock);
+}
+
+static inline void get_sit_bitmap(struct f2fs_sb_info *sbi,
+ void *dst_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ memcpy(dst_addr, sit_i->sit_bitmap, sit_i->bitmap_size);
+}
+
+static inline block_t written_block_count(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_t vblocks;
+
+ mutex_lock(&sit_i->sentry_lock);
+ vblocks = sit_i->written_valid_blocks;
+ mutex_unlock(&sit_i->sentry_lock);
+
+ return vblocks;
+}
+
+static inline unsigned int free_segments(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_segs;
+
+ read_lock(&free_i->segmap_lock);
+ free_segs = free_i->free_segments;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_segs;
+}
+
+static inline int reserved_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->reserved_segments;
+}
+
+static inline unsigned int free_sections(struct f2fs_sb_info *sbi)
+{
+ struct free_segmap_info *free_i = FREE_I(sbi);
+ unsigned int free_secs;
+
+ read_lock(&free_i->segmap_lock);
+ free_secs = free_i->free_sections;
+ read_unlock(&free_i->segmap_lock);
+
+ return free_secs;
+}
+
+static inline unsigned int prefree_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[PRE];
+}
+
+static inline unsigned int dirty_segments(struct f2fs_sb_info *sbi)
+{
+ return DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_DATA] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_WARM_NODE] +
+ DIRTY_I(sbi)->nr_dirty[DIRTY_COLD_NODE];
+}
+
+static inline int overprovision_segments(struct f2fs_sb_info *sbi)
+{
+ return SM_I(sbi)->ovp_segments;
+}
+
+static inline int overprovision_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) overprovision_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline int reserved_sections(struct f2fs_sb_info *sbi)
+{
+ return ((unsigned int) reserved_segments(sbi)) / sbi->segs_per_sec;
+}
+
+static inline bool need_SSR(struct f2fs_sb_info *sbi)
+{
+ return (free_sections(sbi) < overprovision_sections(sbi));
+}
+
+static inline int get_ssr_segment(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return DIRTY_I(sbi)->v_ops->get_victim(sbi,
+ &(curseg)->next_segno, BG_GC, type, SSR);
+}
+
+static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi)
+{
+ return free_sections(sbi) <= reserved_sections(sbi);
+}
+
+static inline int utilization(struct f2fs_sb_info *sbi)
+{
+ return (long int)valid_user_blocks(sbi) * 100 /
+ (long int)sbi->user_block_count;
+}
+
+/*
+ * Sometimes f2fs may be better to drop out-of-place update policy.
+ * So, if fs utilization is over MIN_IPU_UTIL, then f2fs tries to write
+ * data in the original place likewise other traditional file systems.
+ * But, currently set 100 in percentage, which means it is disabled.
+ * See below need_inplace_update().
+ */
+#define MIN_IPU_UTIL 100
+static inline bool need_inplace_update(struct inode *inode)
+{
+ struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ return false;
+ if (need_SSR(sbi) && utilization(sbi) > MIN_IPU_UTIL)
+ return true;
+ return false;
+}
+
+static inline unsigned int curseg_segno(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->segno;
+}
+
+static inline unsigned char curseg_alloc_type(struct f2fs_sb_info *sbi,
+ int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->alloc_type;
+}
+
+static inline unsigned short curseg_blkoff(struct f2fs_sb_info *sbi, int type)
+{
+ struct curseg_info *curseg = CURSEG_I(sbi, type);
+ return curseg->next_blkoff;
+}
+
+static inline void check_seg_range(struct f2fs_sb_info *sbi, unsigned int segno)
+{
+ unsigned int end_segno = SM_I(sbi)->segment_count - 1;
+ BUG_ON(segno > end_segno);
+}
+
+/*
+ * This function is used for only debugging.
+ * NOTE: In future, we have to remove this function.
+ */
+static inline void verify_block_addr(struct f2fs_sb_info *sbi, block_t blk_addr)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ block_t total_blks = sm_info->segment_count << sbi->log_blocks_per_seg;
+ block_t start_addr = sm_info->seg0_blkaddr;
+ block_t end_addr = start_addr + total_blks - 1;
+ BUG_ON(blk_addr < start_addr);
+ BUG_ON(blk_addr > end_addr);
+}
+
+/*
+ * Summary block is always treated as invalid block
+ */
+static inline void check_block_count(struct f2fs_sb_info *sbi,
+ int segno, struct f2fs_sit_entry *raw_sit)
+{
+ struct f2fs_sm_info *sm_info = SM_I(sbi);
+ unsigned int end_segno = sm_info->segment_count - 1;
+ int valid_blocks = 0;
+ int i;
+
+ /* check segment usage */
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) > sbi->blocks_per_seg);
+
+ /* check boundary of a given segment number */
+ BUG_ON(segno > end_segno);
+
+ /* check bitmap with valid block count */
+ for (i = 0; i < sbi->blocks_per_seg; i++)
+ if (f2fs_test_bit(i, raw_sit->valid_map))
+ valid_blocks++;
+ BUG_ON(GET_SIT_VBLOCKS(raw_sit) != valid_blocks);
+}
+
+static inline pgoff_t current_sit_addr(struct f2fs_sb_info *sbi,
+ unsigned int start)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ unsigned int offset = SIT_BLOCK_OFFSET(sit_i, start);
+ block_t blk_addr = sit_i->sit_base_addr + offset;
+
+ check_seg_range(sbi, start);
+
+ /* calculate sit block address */
+ if (f2fs_test_bit(offset, sit_i->sit_bitmap))
+ blk_addr += sit_i->sit_blocks;
+
+ return blk_addr;
+}
+
+static inline pgoff_t next_sit_addr(struct f2fs_sb_info *sbi,
+ pgoff_t block_addr)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ block_addr -= sit_i->sit_base_addr;
+ if (block_addr < sit_i->sit_blocks)
+ block_addr += sit_i->sit_blocks;
+ else
+ block_addr -= sit_i->sit_blocks;
+
+ return block_addr + sit_i->sit_base_addr;
+}
+
+static inline void set_to_next_sit(struct sit_info *sit_i, unsigned int start)
+{
+ unsigned int block_off = SIT_BLOCK_OFFSET(sit_i, start);
+
+ if (f2fs_test_bit(block_off, sit_i->sit_bitmap))
+ f2fs_clear_bit(block_off, sit_i->sit_bitmap);
+ else
+ f2fs_set_bit(block_off, sit_i->sit_bitmap);
+}
+
+static inline unsigned long long get_mtime(struct f2fs_sb_info *sbi)
+{
+ struct sit_info *sit_i = SIT_I(sbi);
+ return sit_i->elapsed_time + CURRENT_TIME_SEC.tv_sec -
+ sit_i->mounted_time;
+}
+
+static inline void set_summary(struct f2fs_summary *sum, nid_t nid,
+ unsigned int ofs_in_node, unsigned char version)
+{
+ sum->nid = cpu_to_le32(nid);
+ sum->ofs_in_node = cpu_to_le16(ofs_in_node);
+ sum->version = version;
+}
+
+static inline block_t start_sum_block(struct f2fs_sb_info *sbi)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_start_sum);
+}
+
+static inline block_t sum_blk_addr(struct f2fs_sb_info *sbi, int base, int type)
+{
+ return __start_cp_addr(sbi) +
+ le32_to_cpu(F2FS_CKPT(sbi)->cp_pack_total_block_count)
+ - (base + 1) + type;
+}
--
1.7.9.5

---
Jaegeuk Kim
Samsung

2012-11-02 13:39:12

by Martin Steigerwald

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Am Mittwoch, 31. Oktober 2012 schrieb Jaegeuk Kim:
> Change log from v2:
>
> o Fix compilation error for arm [Max]
> o Move proc entries to debugfs [Greg]
> o Add i_atime, i_generation, etc [Neil]
> o Support NFS export [Changman]
> o Move the f2fs magic number [Marco]
> o Add s_time_gran [Marco]
> o Fix f2fs_truncate [Marco]
> o Enhance f2fs document [Vyacheslav]
> o Support uuid and add additional comments [Vyacheslav]
> o Change superblock offset [Vyacheslav]
> o Fix some bugs and return values [Vyacheslav]
> o Improve initial mount time [Jaegeuk]
> o Fix f2fs-tools environment [Mike]
>
> I've set up a git tree in sf.net for f2fs-tools.
> Please download f2fs-tools by tag: for_patch_set_v3.

I like to try this out.

Do you have kernel code available in some git repo? I didn´t find anything
on git.kernel.org? Otherwise I take the patches from your mails.

Tools cloned nicely.

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-02 22:49:15

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

2012/11/2 Martin Steigerwald <[email protected]>
>
> Am Mittwoch, 31. Oktober 2012 schrieb Jaegeuk Kim:
> > Change log from v2:
> >
> > o Fix compilation error for arm [Max]
> > o Move proc entries to debugfs [Greg]
> > o Add i_atime, i_generation, etc [Neil]
> > o Support NFS export [Changman]
> > o Move the f2fs magic number [Marco]
> > o Add s_time_gran [Marco]
> > o Fix f2fs_truncate [Marco]
> > o Enhance f2fs document [Vyacheslav]
> > o Support uuid and add additional comments [Vyacheslav]
> > o Change superblock offset [Vyacheslav]
> > o Fix some bugs and return values [Vyacheslav]
> > o Improve initial mount time [Jaegeuk]
> > o Fix f2fs-tools environment [Mike]
> >
> > I've set up a git tree in sf.net for f2fs-tools.
> > Please download f2fs-tools by tag: for_patch_set_v3.
>
> I like to try this out.
>
> Do you have kernel code available in some git repo? I didn?t find anything
> on git.kernel.org? Otherwise I take the patches from your mails.
>

At this moment, there is no f2fs-repo yet in public.
If f2fs is merged, I'd like to make a dev tree somewhere.
I really want to make the dev tree in kernel.org later, but it totally depends
on the maintainers' decision.
Before then, I apologize to retrieve f2fs sources from LKML.
Thanks,

--
Jaegeuk Kim
Samsung

> Tools cloned nicely.
>
> Thanks,
> --
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-11-10 18:33:46

by Martin Steigerwald

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Am Freitag, 2. November 2012 schrieb Kim Jaegeuk:
> 2012/11/2 Martin Steigerwald <[email protected]>
>
> > Am Mittwoch, 31. Oktober 2012 schrieb Jaegeuk Kim:
> > > Change log from v2:
> > > o Fix compilation error for arm [Max]
> > > o Move proc entries to debugfs [Greg]
> > > o Add i_atime, i_generation, etc [Neil]
> > > o Support NFS export [Changman]
> > > o Move the f2fs magic number [Marco]
> > > o Add s_time_gran [Marco]
> > > o Fix f2fs_truncate [Marco]
> > > o Enhance f2fs document [Vyacheslav]
> > > o Support uuid and add additional comments [Vyacheslav]
> > > o Change superblock offset [Vyacheslav]
> > > o Fix some bugs and return values [Vyacheslav]
> > > o Improve initial mount time [Jaegeuk]
> > > o Fix f2fs-tools environment [Mike]
> > >
> > > I've set up a git tree in sf.net for f2fs-tools.
> > > Please download f2fs-tools by tag: for_patch_set_v3.
> >
> > I like to try this out.
> >
> > Do you have kernel code available in some git repo? I didn´t find
> > anything on git.kernel.org? Otherwise I take the patches from your
> > mails.
>
> At this moment, there is no f2fs-repo yet in public.
> If f2fs is merged, I'd like to make a dev tree somewhere.
> I really want to make the dev tree in kernel.org later, but it totally
> depends on the maintainers' decision.
> Before then, I apologize to retrieve f2fs sources from LKML.

After getting behind an USB blocks a CPU on my laptop problem I tried f2fs.

Some first tests on a 2GB extrememory USB 2.0 stick:

[ 239.731162] usb 1-1.1: new high-speed USB device number 8 using ehci_hcd
[ 239.818325] usb 1-1.1: New USB device found, idVendor=1307, idProduct=0163
[ 239.818339] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 239.818346] usb 1-1.1: Product: USB Mass Storage Device
[ 239.818352] usb 1-1.1: Manufacturer: USBest Technology
[ 239.818357] usb 1-1.1: SerialNumber: […]
[ 239.820595] scsi9 : usb-storage 1-1.1:1.0
[ 240.821487] scsi 9:0:0:0: Direct-Access TinyDisk 2007-05-12 0.00 PQ: 0 ANSI: 2
[ 240.825217] sd 9:0:0:0: Attached scsi generic sg2 type 0
[ 240.826318] sd 9:0:0:0: [sdb] 4095999 512-byte logical blocks: (2.09 GB/1.95 GiB)
[ 240.827126] sd 9:0:0:0: [sdb] Write Protect is off
[ 240.827143] sd 9:0:0:0: [sdb] Mode Sense: 00 00 00 00
[ 240.828161] sd 9:0:0:0: [sdb] Asking for cache data failed
[ 240.828174] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 240.833414] sd 9:0:0:0: [sdb] Asking for cache data failed
[ 240.833428] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 240.894728] sdb: unknown partition table
[ 240.899104] sd 9:0:0:0: [sdb] Asking for cache data failed
[ 240.899109] sd 9:0:0:0: [sdb] Assuming drive cache: write through
[ 240.899113] sd 9:0:0:0: [sdb] Attached SCSI removable disk
merkaba:~> fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x7babe038.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Partition type:
p primary (0 primary, 0 extended, 4 free)
e extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-4095998, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-4095998, default 4095998):
Using default value 4095998

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
merkaba:~> dmesg | tail -1
[ 276.342638] sdb: sdb1

merkaba:~> mkfs.f2fs /dev/sdb1
Info: sector size = 512
Info: total sectors = 4093951 (in 512bytes)
Info: zone aligned segment0 blkaddr: 256
Info: This device doesn't support TRIM
Info: format successful
merkaba:~> mount /dev/sdb1 /mnt/zeit
mount: you must specify the filesystem type
merkaba:~#32> mount -t f2fs /dev/sdb1 /mnt/zeit
merkaba:~> df -hT /mnt/zeit
Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf
/dev/sdb1 f2fs 2,0G 147M 1,8G 8% /mnt/zeit

merkaba:~> dd if=/dev/zero of=/mnt/zeit/zeros bs=1M count=200 conv=fsync
200+0 Datensätze ein
200+0 Datensätze aus
209715200 Bytes (210 MB) kopiert, 14,4896 s, 14,5 MB/s

merkaba:~> echo 3 > /proc/sys/vm/drop_caches; dd if=/mnt/zeit/zeros of=/dev/null bs=1M count=200
200+0 Datensätze ein
200+0 Datensätze aus
209715200 Bytes (210 MB) kopiert, 8,7358 s, 24,0 MB/s
merkaba:~> umount /mnt/zeit

merkaba:~> mkfs.ext4 /dev/sdb1
mke2fs 1.42.5 (29-Jul-2012)
Dateisystem-Label=
OS-Typ: Linux
Blockgröße=4096 (log=2)
Fragmentgröße=4096 (log=2)
Stride=0 Blöcke, Stripebreite=0 Blöcke
128000 Inodes, 511743 Blöcke
25587 Blöcke (5.00%) reserviert für den Superuser
Erster Datenblock=0
Maximale Dateisystem-Blöcke=524288000
16 Blockgruppen
32768 Blöcke pro Gruppe, 32768 Fragmente pro Gruppe
8000 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den Blöcken:
32768, 98304, 163840, 229376, 294912

Platz für Gruppentabellen wird angefordert: erledigt
Inode-Tabellen werden geschrieben: erledigt
Erstelle Journal (8192 Blöcke): erledigt
Schreibe Superblöcke und Dateisystem-Accountinginformationen: erledigt

merkaba:~> mount /dev/sdb1 /mnt/zeit
merkaba:~> df -hT /mnt/zeit
Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf
/dev/sdb1 ext4 2,0G 35M 1,8G 2% /mnt/zeit
merkaba:~> dd if=/dev/zero of=/mnt/zeit/zeros bs=1M count=200 conv=fsync
200+0 Datensätze ein
200+0 Datensätze aus
209715200 Bytes (210 MB) kopiert, 17,8875 s, 11,7 MB/s
merkaba:~> echo 3 > /proc/sys/vm/drop_caches; dd if=/mnt/zeit/zeros of=/dev/null bs=1M count=200
200+0 Datensätze ein
200+0 Datensätze aus
209715200 Bytes (210 MB) kopiert, 8,76721 s, 23,9 MB/s

merkaba:~> umount /mnt/zeit

So writing appears to be faster with large block sizes. But this was just
one iteration, so that would need to be verified by several subsequent
runs.

merkaba:~> mkfs.f2fs /dev/sdb1
Info: sector size = 512
Info: total sectors = 4093951 (in 512bytes)
Info: zone aligned segment0 blkaddr: 256
Info: This device doesn't support TRIM
Info: format successful
merkaba:~> mount -t f2fs /dev/sdb1 /mnt/zeit
merkaba:~> fio /tmp/usb-stick.job
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
2.0.8
Starting 4 processes
seq-read: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [___w] [57.2% done] [0K/703K /s] [0 /175 iops] [eta 03m:00s]
seq-read: (groupid=0, jobs=1): err= 0: pid=7819
read : io=470600KB, bw=7843.8KB/s, iops=1960 , runt= 60002msec
slat (usec): min=2 , max=176 , avg=16.70, stdev= 9.41
clat (usec): min=632 , max=37073 , avg=2018.40, stdev=237.35
lat (usec): min=678 , max=37106 , avg=2035.70, stdev=237.23
clat percentiles (usec):
| 1.00th=[ 1848], 5.00th=[ 1928], 10.00th=[ 1960], 20.00th=[ 1960],
| 30.00th=[ 1976], 40.00th=[ 1976], 50.00th=[ 1992], 60.00th=[ 1992],
| 70.00th=[ 2008], 80.00th=[ 2096], 90.00th=[ 2128], 95.00th=[ 2224],
| 99.00th=[ 2384], 99.50th=[ 2480], 99.90th=[ 2768], 99.95th=[ 3088],
| 99.99th=[ 6880]
bw (KB/s) : min= 7080, max= 7992, per=100.00%, avg=7849.21, stdev=89.82
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=64.51%, 4=35.48%, 10=0.01%, 50=0.01%
cpu : usr=1.65%, sys=5.03%, ctx=89829, majf=0, minf=26
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=117650/w=0/d=0, short=r=0/w=0/d=0
rand-read: (groupid=1, jobs=1): err= 0: pid=7828
read : io=344332KB, bw=5738.6KB/s, iops=1434 , runt= 60003msec
slat (usec): min=3 , max=298 , avg=18.09, stdev= 9.59
clat (usec): min=887 , max=6981 , avg=2764.00, stdev=694.65
lat (usec): min=995 , max=6996 , avg=2782.73, stdev=694.59
clat percentiles (usec):
| 1.00th=[ 1848], 5.00th=[ 1960], 10.00th=[ 1976], 20.00th=[ 2096],
| 30.00th=[ 2128], 40.00th=[ 2608], 50.00th=[ 2736], 60.00th=[ 2832],
| 70.00th=[ 2928], 80.00th=[ 3376], 90.00th=[ 3600], 95.00th=[ 4128],
| 99.00th=[ 4832], 99.50th=[ 5024], 99.90th=[ 5728], 99.95th=[ 6112],
| 99.99th=[ 6688]
bw (KB/s) : min= 5632, max= 5800, per=100.00%, avg=5742.82, stdev=27.33
lat (usec) : 1000=0.01%
lat (msec) : 2=13.36%, 4=80.18%, 10=6.46%
cpu : usr=1.42%, sys=4.42%, ctx=87501, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=86083/w=0/d=0, short=r=0/w=0/d=0
seq-write: (groupid=2, jobs=1): err= 0: pid=7832
write: io=40480KB, bw=690824 B/s, iops=168 , runt= 60003msec
slat (msec): min=1 , max=71 , avg= 5.91, stdev=11.88
clat (usec): min=11 , max=77132 , avg=17788.60, stdev=18686.66
lat (msec): min=2 , max=111 , avg=23.70, stdev=21.99
clat percentiles (usec):
| 1.00th=[ 6368], 5.00th=[ 6688], 10.00th=[ 6816], 20.00th=[ 6944],
| 30.00th=[ 7008], 40.00th=[ 7072], 50.00th=[ 7136], 60.00th=[ 7200],
| 70.00th=[ 7456], 80.00th=[39168], 90.00th=[39680], 95.00th=[39680],
| 99.00th=[76288], 99.50th=[76288], 99.90th=[77312], 99.95th=[77312],
| 99.99th=[77312]
bw (KB/s) : min= 597, max= 726, per=100.00%, avg=675.17, stdev=47.46
lat (usec) : 20=0.01%
lat (msec) : 4=0.01%, 10=71.88%, 50=23.42%, 100=4.68%
cpu : usr=0.49%, sys=2.05%, ctx=10731, majf=0, minf=22
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=10120/d=0, short=r=0/w=0/d=0
rand-write: (groupid=3, jobs=1): err= 0: pid=7875
write: io=40212KB, bw=686021 B/s, iops=167 , runt= 60023msec
slat (msec): min=2 , max=71 , avg= 5.95, stdev=11.94
clat (usec): min=10 , max=77297 , avg=17916.27, stdev=18776.67
lat (msec): min=8 , max=111 , avg=23.87, stdev=22.09
clat percentiles (usec):
| 1.00th=[ 6560], 5.00th=[ 6816], 10.00th=[ 6944], 20.00th=[ 7008],
| 30.00th=[ 7072], 40.00th=[ 7136], 50.00th=[ 7200], 60.00th=[ 7264],
| 70.00th=[ 7520], 80.00th=[39168], 90.00th=[39680], 95.00th=[39680],
| 99.00th=[76288], 99.50th=[77312], 99.90th=[77312], 99.95th=[77312],
| 99.99th=[77312]
bw (KB/s) : min= 582, max= 718, per=100.00%, avg=670.32, stdev=44.67
lat (usec) : 20=0.01%
lat (msec) : 10=71.85%, 50=23.46%, 100=4.69%
cpu : usr=0.46%, sys=2.14%, ctx=10674, majf=0, minf=21
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=10053/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
READ: io=470600KB, aggrb=7843KB/s, minb=7843KB/s, maxb=7843KB/s, mint=60002msec, maxt=60002msec

Run status group 1 (all jobs):
READ: io=344332KB, aggrb=5738KB/s, minb=5738KB/s, maxb=5738KB/s, mint=60003msec, maxt=60003msec

Run status group 2 (all jobs):
WRITE: io=40480KB, aggrb=674KB/s, minb=674KB/s, maxb=674KB/s, mint=60003msec, maxt=60003msec

Run status group 3 (all jobs):
WRITE: io=40212KB, aggrb=669KB/s, minb=669KB/s, maxb=669KB/s, mint=60023msec, maxt=60023msec

Disk stats (read/write):
sdb: ios=174042/20172, merge=29691/0, ticks=412897/117376, in_queue=529790, util=98.65%
merkaba:~> umount /mnt/zeit

merkaba:~> mkfs.ext4 /dev/sdb1
mke2fs 1.42.5 (29-Jul-2012)
Dateisystem-Label=
OS-Typ: Linux
Blockgröße=4096 (log=2)
[…]
merkaba:~> mount /dev/sdb1 /mnt/zeit
merkaba:~> fio /tmp/usb-stick.job
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
2.0.8
Starting 4 processes
seq-read: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [___w] [57.2% done] [0K/131K /s] [0 /32 iops] [eta 03m:00s]
seq-read: (groupid=0, jobs=1): err= 0: pid=8271
read : io=467860KB, bw=7797.5KB/s, iops=1949 , runt= 60002msec
slat (usec): min=2 , max=933 , avg=18.24, stdev=10.59
clat (usec): min=728 , max=112945 , avg=2028.78, stdev=911.01
lat (usec): min=775 , max=112959 , avg=2047.63, stdev=910.94
clat percentiles (usec):
| 1.00th=[ 1864], 5.00th=[ 1944], 10.00th=[ 1960], 20.00th=[ 1960],
| 30.00th=[ 1976], 40.00th=[ 1976], 50.00th=[ 1976], 60.00th=[ 1992],
| 70.00th=[ 2008], 80.00th=[ 2096], 90.00th=[ 2128], 95.00th=[ 2224],
| 99.00th=[ 2384], 99.50th=[ 2480], 99.90th=[ 2864], 99.95th=[ 3184],
| 99.99th=[55040]
bw (KB/s) : min= 5538, max= 7984, per=100.00%, avg=7805.33, stdev=262.86
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=64.92%, 4=35.05%, 10=0.01%, 50=0.01%, 100=0.01%
lat (msec) : 250=0.01%
cpu : usr=1.58%, sys=5.43%, ctx=89023, majf=0, minf=26
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=116965/w=0/d=0, short=r=0/w=0/d=0
rand-read: (groupid=1, jobs=1): err= 0: pid=8282
read : io=334860KB, bw=5580.8KB/s, iops=1395 , runt= 60003msec
slat (usec): min=4 , max=346 , avg=23.79, stdev=11.60
clat (usec): min=819 , max=7300 , avg=2835.92, stdev=710.54
lat (usec): min=945 , max=7318 , avg=2861.00, stdev=710.61
clat percentiles (usec):
| 1.00th=[ 1912], 5.00th=[ 1976], 10.00th=[ 2064], 20.00th=[ 2096],
| 30.00th=[ 2224], 40.00th=[ 2704], 50.00th=[ 2832], 60.00th=[ 2864],
| 70.00th=[ 2992], 80.00th=[ 3472], 90.00th=[ 3696], 95.00th=[ 4192],
| 99.00th=[ 4960], 99.50th=[ 5216], 99.90th=[ 5856], 99.95th=[ 6240],
| 99.99th=[ 6816]
bw (KB/s) : min= 5504, max= 5640, per=100.00%, avg=5584.81, stdev=22.35
lat (usec) : 1000=0.01%
lat (msec) : 2=7.52%, 4=85.09%, 10=7.39%
cpu : usr=1.77%, sys=5.29%, ctx=86411, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=83715/w=0/d=0, short=r=0/w=0/d=0
seq-write: (groupid=2, jobs=1): err= 0: pid=8301
write: io=53900KB, bw=919295 B/s, iops=224 , runt= 60039msec
slat (usec): min=3 , max=87697 , avg=91.25, stdev=1634.30
clat (msec): min=2 , max=126 , avg=17.72, stdev=20.48
lat (msec): min=2 , max=135 , avg=17.81, stdev=20.61
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 7],
| 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 8], 60.00th=[ 8],
| 70.00th=[ 8], 80.00th=[ 39], 90.00th=[ 40], 95.00th=[ 80],
| 99.00th=[ 81], 99.50th=[ 81], 99.90th=[ 85], 99.95th=[ 90],
| 99.99th=[ 125]
bw (KB/s) : min= 780, max= 958, per=100.00%, avg=897.22, stdev=22.03
lat (msec) : 4=0.08%, 10=74.52%, 20=0.03%, 50=19.01%, 100=6.32%
lat (msec) : 250=0.04%
cpu : usr=0.60%, sys=1.69%, ctx=10666, majf=0, minf=22
IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=13475/d=0, short=r=0/w=0/d=0
rand-write: (groupid=3, jobs=1): err= 0: pid=8672
write: io=7520.0KB, bw=128085 B/s, iops=31 , runt= 60120msec
slat (usec): min=12 , max=132582 , avg=662.60, stdev=8065.44
clat (msec): min=6 , max=314 , avg=127.16, stdev=41.47
lat (msec): min=6 , max=314 , avg=127.82, stdev=41.47
clat percentiles (msec):
| 1.00th=[ 38], 5.00th=[ 64], 10.00th=[ 77], 20.00th=[ 94],
| 30.00th=[ 99], 40.00th=[ 120], 50.00th=[ 127], 60.00th=[ 131],
| 70.00th=[ 149], 80.00th=[ 159], 90.00th=[ 182], 95.00th=[ 202],
| 99.00th=[ 247], 99.50th=[ 251], 99.90th=[ 306], 99.95th=[ 314],
| 99.99th=[ 314]
bw (KB/s) : min= 98, max= 143, per=99.86%, avg=124.83, stdev= 7.49
lat (msec) : 10=0.05%, 20=0.05%, 50=2.61%, 100=29.41%, 250=67.13%
lat (msec) : 500=0.74%
cpu : usr=0.16%, sys=0.34%, ctx=1904, majf=0, minf=20
IO depths : 1=0.1%, 2=0.1%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1880/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
READ: io=467860KB, aggrb=7797KB/s, minb=7797KB/s, maxb=7797KB/s, mint=60002msec, maxt=60002msec

Run status group 1 (all jobs):
READ: io=334860KB, aggrb=5580KB/s, minb=5580KB/s, maxb=5580KB/s, mint=60003msec, maxt=60003msec

Run status group 2 (all jobs):
WRITE: io=53900KB, aggrb=897KB/s, minb=897KB/s, maxb=897KB/s, mint=60039msec, maxt=60039msec

Run status group 3 (all jobs):
WRITE: io=7520KB, aggrb=125KB/s, minb=125KB/s, maxb=125KB/s, mint=60120msec, maxt=60120msec

Disk stats (read/write):
sdb: ios=171292/11991, merge=29388/3441, ticks=411887/417030, in_queue=828667, util=99.78%

merkaba:~> umount /mnt/zeit

On read F2FS and Ext4 have similar results.

Ext4 is a bit faster on sequential writes here, which contradicts dd
from above.

But F2FS is a lot, a huge lot faster on random writes!

I wonder how that is related to the iodepth I used. (Copied over from
ssd-test fio example file.). According to fio it did use iodepth 4. I wonder
about reasonable value for testing.

This has just been some initial testing. I wonder about good tests that mimic
typical use cases. What are the usecases?

- SD cards: photos, music files for portable players. smartphones, pads
with Android?

- USB-Sticks: random files users puts there, live distributions like GRML
and distro installers

Unless lots of small files stored or Linux being booted, its more larger
block size I/O it seems to me. Android likely deals with small files as
well. (I do not use a smartphone yet.)

While I use an USB stick to boot up Debian on my ASUS WL-500gP DSL router
that may not be a wide-spread usecase.

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-10 18:40:29

by Martin Steigerwald

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Am Samstag, 10. November 2012 schrieb Martin Steigerwald:
> merkaba:~> mkfs.f2fs /dev/sdb1
> Info: sector size = 512
> Info: total sectors = 4093951 (in 512bytes)
> Info: zone aligned segment0 blkaddr: 256
> Info: This device doesn't support TRIM
> Info: format successful
> merkaba:~> mount -t f2fs /dev/sdb1 /mnt/zeit
> merkaba:~> fio /tmp/usb-stick.job
> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
> iodepth=4 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
> iodepth=4 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K,
> ioengine=libaio, iodepth=4 2.0.8
> Starting 4 processes
> seq-read: Laying out IO file(s) (1 file(s) / 512MB)
> Jobs: 1 (f=1): [___w] [57.2% done] [0K/703K /s] [0 /175 iops] [eta
> 03m:00s] seq-read: (groupid=0, jobs=1): err= 0: pid=7819
> read : io=470600KB, bw=7843.8KB/s, iops=1960 , runt= 60002msec

I forgot the job file I used:

merkaba:~> cat /tmp/usb-stick.job
[global]
bs=4k
ioengine=libaio
size=512m
direct=1
iodepth=4
runtime=60
directory=/mnt/zeit
filename=test.file

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-10 20:55:24

by Viacheslav Dubeyko

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Hi,

On Nov 10, 2012, at 9:33 PM, Martin Steigerwald wrote:

[snip]
>
> merkaba:~> mkfs.f2fs /dev/sdb1
> Info: sector size = 512
> Info: total sectors = 4093951 (in 512bytes)
> Info: zone aligned segment0 blkaddr: 256
> Info: This device doesn't support TRIM
> Info: format successful
> merkaba:~> mount /dev/sdb1 /mnt/zeit
> mount: you must specify the filesystem type
> merkaba:~#32> mount -t f2fs /dev/sdb1 /mnt/zeit
> merkaba:~> df -hT /mnt/zeit
> Dateisystem Typ Gr??e Benutzt Verf. Verw% Eingeh?ngt auf
> /dev/sdb1 f2fs 2,0G 147M 1,8G 8% /mnt/zeit
>

Do you really have trouble with f2fs mount without definition of filesystem type? If so, it is a bug.

Moreover, I think that it is really inconvenient that mkfs.f2fs doesn't output its version. I suggest to output version of f2fs utilities.

With the best regards,
Vyacheslav Dubeyko.

2012-11-10 21:50:01

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Saturday 10 November 2012, Martin Steigerwald wrote:
> Command (m for help): n
> Partition type:
> p primary (0 primary, 0 extended, 4 free)
> e extended
> Select (default p): p
> Partition number (1-4, default 1): 1
> First sector (2048-4095998, default 2048):
> Using default value 2048
> Last sector, +sectors or +size{K,M,G} (2048-4095998, default 4095998):
> Using default value 4095998

This is almost certainly not the right setting for f2fs, which only works
at its design point if the segments are aligned to erase blocks. All modern
flash devices have erase blocks larger than 1 MB, so starting the partition
at a 1 MB offset will cause it to be misaligned. Also, some USB sticks
have an area optimized for random writes in the beginning of the drive
where both FAT32 and f2fs store their metadata. It may be worth testing
again without a partition table, using just the raw device.

I would also recommend using flashbench to find out the optimum parameters
for your device. You can download it from
git://git.linaro.org/people/arnd/flashbench.git
In the long run, we should automate those tests and make them part of
mkfs.f2fs, but for now, try to find out the erase block size and the number
of concurrently used erase blocks on your device using a timing attack
in flashbench. The README file in there explains how to interpret the
results from "./flashbench -a /dev/sdb --blocksize=1024" to guess
the erase block size, although that sometimes doesn't work.

With the correct guess, compare the performance you get using

$ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
$ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
$ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
$ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
$ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
$ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}

The first one of those should always be the fastest, hopefully followed by
some that are equally fast and then some much slower ones (especially for the
smaller block sizes). The "active_logs=N" mount option should be one less
than the highest number above that is still "fast", and only "2", "4" and "6"
are valid at the moment. If you are lucky, your device is still fast with
"--open-au-nr=7" and slow only for higher numbers, then the default of "6"
is ok.

If the erase size is larger than 2 MB, then you have to "-s" option in
mkfs.f2fs to configure how many 2 MB segments there are in one erase block.
For a 2 GB USB stick, I would guess that the erase block size is 1, 2 or
4 MB. Newer (larger) sticks will have larger erase blocks that may also
be a multiple of 3 MB (3, 6, 12, or 24).

Arnd

2012-11-11 11:43:01

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

2012-11-11 (일), 00:55 +0300, Vyacheslav Dubeyko:
> Hi,
>
> On Nov 10, 2012, at 9:33 PM, Martin Steigerwald wrote:
>
> [snip]
> >
> > merkaba:~> mkfs.f2fs /dev/sdb1
> > Info: sector size = 512
> > Info: total sectors = 4093951 (in 512bytes)
> > Info: zone aligned segment0 blkaddr: 256
> > Info: This device doesn't support TRIM
> > Info: format successful
> > merkaba:~> mount /dev/sdb1 /mnt/zeit
> > mount: you must specify the filesystem type
> > merkaba:~#32> mount -t f2fs /dev/sdb1 /mnt/zeit
> > merkaba:~> df -hT /mnt/zeit
> > Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf
> > /dev/sdb1 f2fs 2,0G 147M 1,8G 8% /mnt/zeit
> >
>
> Do you really have trouble with f2fs mount without definition of filesystem type? If so, it is a bug.

Could you explain this bug in more detail?
Or, can anyone comment this?

Actually, I've looking at the *mount* in linux-utils.
I suspect something does not support f2fs in linux-utils.

>
> Moreover, I think that it is really inconvenient that mkfs.f2fs doesn't output its version. I suggest to output version of f2fs utilities.

Agreed.
But, I'd like to note that f2fs is not merged yet.
IMHO, it'd better give an initial version after f2fs is merged.
Thanks,

>
> With the best regards,
> Vyacheslav Dubeyko.
>

--
Jaegeuk Kim
Samsung

2012-11-12 06:04:16

by Viacheslav Dubeyko

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Sun, 2012-11-11 at 20:42 +0900, Jaegeuk Kim wrote:
> 2012-11-11 (일), 00:55 +0300, Vyacheslav Dubeyko:
> > Hi,
> >
> > On Nov 10, 2012, at 9:33 PM, Martin Steigerwald wrote:
> >
> > [snip]
> > >
> > > merkaba:~> mkfs.f2fs /dev/sdb1
> > > Info: sector size = 512
> > > Info: total sectors = 4093951 (in 512bytes)
> > > Info: zone aligned segment0 blkaddr: 256
> > > Info: This device doesn't support TRIM
> > > Info: format successful
> > > merkaba:~> mount /dev/sdb1 /mnt/zeit
> > > mount: you must specify the filesystem type
> > > merkaba:~#32> mount -t f2fs /dev/sdb1 /mnt/zeit
> > > merkaba:~> df -hT /mnt/zeit
> > > Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf
> > > /dev/sdb1 f2fs 2,0G 147M 1,8G 8% /mnt/zeit
> > >
> >
> > Do you really have trouble with f2fs mount without definition of filesystem type? If so, it is a bug.
>
> Could you explain this bug in more detail?
> Or, can anyone comment this?
>
> Actually, I've looking at the *mount* in linux-utils.
> I suspect something does not support f2fs in linux-utils.
>

Does this issue reproducible or not on last version of f2fs utilities?
There are some probability that it can be an issue of previous version.

As I can understand, if Linux kernel has support of concrete file system
then it should be mounted as with file system type definition as without
it. From my point of understanding, the linux-utils has some
universality and support of all possible file system than kernel can
support. Or I misunderstand something? I think that it can be an issue
on the kernel side. It needs to analyze system log in the case of issue
reproducibility, I think.

With the best regards,
Vyacheslav Dubeyko.

> >
> > Moreover, I think that it is really inconvenient that mkfs.f2fs doesn't output its version. I suggest to output version of f2fs utilities.
>
> Agreed.
> But, I'd like to note that f2fs is not merged yet.
> IMHO, it'd better give an initial version after f2fs is merged.
> Thanks,
>
> >
> > With the best regards,
> > Vyacheslav Dubeyko.
> >
>

2012-11-12 15:16:27

by Martin Steigerwald

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Am Samstag, 10. November 2012 schrieb Arnd Bergmann:
> On Saturday 10 November 2012, Martin Steigerwald wrote:
> > Command (m for help): n
> > Partition type:
> > p primary (0 primary, 0 extended, 4 free)
> > e extended
> > Select (default p): p
> > Partition number (1-4, default 1): 1
> > First sector (2048-4095998, default 2048):
> > Using default value 2048
> > Last sector, +sectors or +size{K,M,G} (2048-4095998, default 4095998):
> > Using default value 4095998
>
> This is almost certainly not the right setting for f2fs, which only works
> at its design point if the segments are aligned to erase blocks. All modern
> flash devices have erase blocks larger than 1 MB, so starting the partition
> at a 1 MB offset will cause it to be misaligned. Also, some USB sticks
> have an area optimized for random writes in the beginning of the drive
> where both FAT32 and f2fs store their metadata. It may be worth testing
> again without a partition table, using just the raw device.

Thank you for your hints, Arnd, much appreciated.

I already suspected as such after having read some of the fine documents on
the linaro website.

As I want to write some article to give Linux users some insight about
Linux on "cheap" flash, I am willing to learn more.

> I would also recommend using flashbench to find out the optimum parameters
> for your device. You can download it from
> git://git.linaro.org/people/arnd/flashbench.git
> In the long run, we should automate those tests and make them part of
> mkfs.f2fs, but for now, try to find out the erase block size and the number
> of concurrently used erase blocks on your device using a timing attack
> in flashbench. The README file in there explains how to interpret the
> results from "./flashbench -a /dev/sdb --blocksize=1024" to guess
> the erase block size, although that sometimes doesn't work.

Why do I use a blocksize of 1024 if the kernel reports me 512 byte blocks?

[ 3112.144086] scsi9 : usb-storage 1-1.1:1.0
[ 3113.145968] scsi 9:0:0:0: Direct-Access TinyDisk 2007-05-12 0.00 PQ: 0 ANSI: 2
[ 3113.146476] sd 9:0:0:0: Attached scsi generic sg2 type 0
[ 3113.147935] sd 9:0:0:0: [sdb] 4095999 512-byte logical blocks: (2.09 GB/1.95 GiB)
[ 3113.148935] sd 9:0:0:0: [sdb] Write Protect is off

And how do reads give information about erase block size? Wouldn´t writes me
more conclusive for that? (Having to erase one versus two erase blocks?)

Hmmm, I get very varying results here with said USB stick:

merkaba:~> /tmp/flashbench -a /dev/sdb
align 536870912 pre 1.1ms on 1.1ms post 1.08ms diff 13µs
align 268435456 pre 1.2ms on 1.19ms post 1.16ms diff 11.6µs
align 134217728 pre 1.12ms on 1.14ms post 1.15ms diff 9.51µs
align 67108864 pre 1.12ms on 1.15ms post 1.12ms diff 29.9µs
align 33554432 pre 1.11ms on 1.17ms post 1.13ms diff 49µs
align 16777216 pre 1.14ms on 1.16ms post 1.15ms diff 22.4µs
align 8388608 pre 1.12ms on 1.09ms post 1.06ms diff -2053ns
align 4194304 pre 1.13ms on 1.16ms post 1.14ms diff 21.7µs
align 2097152 pre 1.11ms on 1.08ms post 1.1ms diff -18488n
align 1048576 pre 1.11ms on 1.11ms post 1.11ms diff -2461ns
align 524288 pre 1.15ms on 1.17ms post 1.1ms diff 45.4µs
align 262144 pre 1.11ms on 1.13ms post 1.13ms diff 12µs
align 131072 pre 1.1ms on 1.09ms post 1.16ms diff -38025n
align 65536 pre 1.09ms on 1.08ms post 1.11ms diff -21353n
align 32768 pre 1.1ms on 1.08ms post 1.11ms diff -23854n
merkaba:~> /tmp/flashbench -a /dev/sdb
align 536870912 pre 1.11ms on 1.13ms post 1.13ms diff 10.6µs
align 268435456 pre 1.12ms on 1.2ms post 1.17ms diff 61.4µs
align 134217728 pre 1.14ms on 1.19ms post 1.15ms diff 46.8µs
align 67108864 pre 1.08ms on 1.15ms post 1.08ms diff 63.8µs
align 33554432 pre 1.09ms on 1.08ms post 1.09ms diff -4761ns
align 16777216 pre 1.12ms on 1.14ms post 1.07ms diff 41.4µs
align 8388608 pre 1.1ms on 1.1ms post 1.09ms diff 7.48µs
align 4194304 pre 1.08ms on 1.1ms post 1.1ms diff 10.1µs
align 2097152 pre 1.1ms on 1.11ms post 1.1ms diff 16µs
align 1048576 pre 1.09ms on 1.1ms post 1.07ms diff 15.5µs
align 524288 pre 1.12ms on 1.12ms post 1.1ms diff 11µs
align 262144 pre 1.13ms on 1.13ms post 1.1ms diff 21.6µs
align 131072 pre 1.11ms on 1.13ms post 1.12ms diff 17.9µs
align 65536 pre 1.07ms on 1.1ms post 1.1ms diff 11.6µs
align 32768 pre 1.09ms on 1.11ms post 1.13ms diff -5131ns
merkaba:~> /tmp/flashbench -a /dev/sdb
align 536870912 pre 1.2ms on 1.18ms post 1.21ms diff -27496n
align 268435456 pre 1.22ms on 1.21ms post 1.24ms diff -18972n
align 134217728 pre 1.15ms on 1.19ms post 1.14ms diff 42.5µs
align 67108864 pre 1.08ms on 1.09ms post 1.08ms diff 5.29µs
align 33554432 pre 1.18ms on 1.19ms post 1.18ms diff 9.25µs
align 16777216 pre 1.18ms on 1.22ms post 1.17ms diff 48.6µs
align 8388608 pre 1.14ms on 1.17ms post 1.19ms diff 4.36µs
align 4194304 pre 1.16ms on 1.2ms post 1.11ms diff 65.8µs
align 2097152 pre 1.13ms on 1.09ms post 1.12ms diff -37718n
align 1048576 pre 1.15ms on 1.2ms post 1.18ms diff 34.9µs
align 524288 pre 1.14ms on 1.19ms post 1.16ms diff 41.5µs
align 262144 pre 1.19ms on 1.12ms post 1.15ms diff -52725n
align 131072 pre 1.21ms on 1.11ms post 1.14ms diff -68522n
align 65536 pre 1.21ms on 1.13ms post 1.18ms diff -64248n
align 32768 pre 1.14ms on 1.25ms post 1.12ms diff 116µs

Even when I apply the explaination of the README I do not seem to get a
clear picture of the stick erase block size.

The values above seem to indicate to me: I don´t care about alignment at all.

With another flash, likely slower Intenso 4GB stick I get:

[ 3672.512143] scsi 10:0:0:0: Direct-Access Ut165 USB2FlashStorage 0.00 PQ: 0 ANSI: 2
[ 3672.514469] sd 10:0:0:0: Attached scsi generic sg2 type 0
[ 3672.514991] sd 10:0:0:0: [sdb] 7897088 512-byte logical blocks: (4.04 GB/3.76 GiB)
[…]
merkaba:~> /tmp/flashbench -a /dev/sdb
align 1073741824 pre 1.06ms on 1.03ms post 951µs diff 26.1µs
align 536870912 pre 1.06ms on 1ms post 941µs diff 1.17µs
align 268435456 pre 995µs on 957µs post 887µs diff 15.7µs
align 134217728 pre 994µs on 951µs post 883µs diff 12.4µs
align 67108864 pre 994µs on 989µs post 1.02ms diff -15104n
align 33554432 pre 934µs on 974µs post 1ms diff 4.16µs
align 16777216 pre 946µs on 916µs post 900µs diff -6588ns
align 8388608 pre 883µs on 881µs post 880µs diff -1176ns
align 4194304 pre 884µs on 884µs post 885µs diff -159ns

here?

align 2097152 pre 880µs on 879µs post 783µs diff 47.6µs
align 1048576 pre 877µs on 881µs post 878µs diff 3.92µs
align 524288 pre 869µs on 870µs post 875µs diff -2101ns
align 262144 pre 871µs on 875µs post 885µs diff -2539ns
align 131072 pre 878µs on 893µs post 900µs diff 3.6µs
align 65536 pre 851µs on 881µs post 884µs diff 13.7µs
align 32768 pre 836µs on 833µs post 880µs diff -25556n
merkaba:~> /tmp/flashbench -a /dev/sdb
align 1073741824 pre 1.07ms on 1e+03µ post 962µs diff -14615n
align 536870912 pre 1.06ms on 1.01ms post 940µs diff 12.2µs
align 268435456 pre 1ms on 943µs post 885µs diff -1132ns
align 134217728 pre 995µs on 982µs post 909µs diff 30µs
align 67108864 pre 999µs on 995µs post 1.01ms diff -9707ns
align 33554432 pre 960µs on 1.01ms post 1.03ms diff 15.2µs
align 16777216 pre 954µs on 928µs post 878µs diff 12.1µs
align 8388608 pre 872µs on 900µs post 895µs diff 16.5µs
align 4194304 pre 895µs on 862µs post 890µs diff -30439n
align 2097152 pre 889µs on 901µs post 876µs diff 18.7µs
align 1048576 pre 900µs on 898µs post 897µs diff -708ns

here?

align 524288 pre 885µs on 874µs post 881µs diff -8470ns
align 262144 pre 817µs on 873µs post 878µs diff 25.6µs
align 131072 pre 882µs on 854µs post 881µs diff -27423n
align 65536 pre 866µs on 890µs post 885µs diff 14.3µs
align 32768 pre 900µs on 881µs post 893µs diff -15412n
merkaba:~> /tmp/flashbench -a /dev/sdb
align 1073741824 pre 1.12ms on 1.02ms post 949µs diff -12574n
align 536870912 pre 1.07ms on 1.03ms post 948µs diff 16.5µs
align 268435456 pre 1.01ms on 958µs post 883µs diff 12.1µs
align 134217728 pre 994µs on 946µs post 879µs diff 9.2µs
align 67108864 pre 1ms on 1.05ms post 1.03ms diff 37.9µs
align 33554432 pre 942µs on 1.01ms post 1.03ms diff 20.6µs
align 16777216 pre 939µs on 903µs post 880µs diff -5972ns
align 8388608 pre 900µs on 914µs post 923µs diff 2.42µs
align 4194304 pre 894µs on 886µs post 882µs diff -1563ns

here?

align 2097152 pre 829µs on 890µs post 874µs diff 37.8µs
align 1048576 pre 899µs on 882µs post 843µs diff 11.1µs
align 524288 pre 890µs on 887µs post 902µs diff -9005ns
align 262144 pre 887µs on 887µs post 898µs diff -5474ns
align 131072 pre 928µs on 895µs post 914µs diff -26028n
align 65536 pre 898µs on 898µs post 894µs diff 2.59µs
align 32768 pre 884µs on 891µs post 901µs diff -1284ns

Similar picture. The diffs seem to be mostly quite small with only some
micro seconds. Or am I misreading something?

Then with a quite fast one 16 GB Transcend.

[ 4055.393399] sd 11:0:0:0: Attached scsi generic sg2 type 0
[ 4055.394729] sd 11:0:0:0: [sdb] 31375360 512-byte logical blocks: (16.0 GB/14.9 GiB)
[ 4055.395262] sd 11:0:0:0: [sdb] Write Protect is off

merkaba:~> /tmp/flashbench -a /dev/sdb
align 4294967296 pre 1.28ms on 1.48ms post 1.33ms diff 179µs
align 2147483648 pre 1.32ms on 1.51ms post 1.33ms diff 181µs
align 1073741824 pre 1.31ms on 1.46ms post 1.35ms diff 132µs
align 536870912 pre 1.27ms on 1.52ms post 1.33ms diff 228µs
align 268435456 pre 1.28ms on 1.46ms post 1.31ms diff 161µs
align 134217728 pre 1.28ms on 1.44ms post 1.37ms diff 120µs
align 67108864 pre 1.27ms on 1.44ms post 1.34ms diff 133µs
align 33554432 pre 1.24ms on 1.42ms post 1.31ms diff 150µs
align 16777216 pre 1.23ms on 1.46ms post 1.26ms diff 218µs
align 8388608 pre 1.31ms on 1.5ms post 1.33ms diff 180µs
align 4194304 pre 1.27ms on 1.45ms post 1.36ms diff 135µs
align 2097152 pre 1.29ms on 1.37ms post 1.39ms diff 33.7µs

here?

align 1048576 pre 1.31ms on 1.44ms post 1.35ms diff 115µs
align 524288 pre 1.33ms on 1.39ms post 1.48ms diff -12297n
align 262144 pre 1.36ms on 1.42ms post 1.4ms diff 45.6µs
align 131072 pre 1.37ms on 1.44ms post 1.4ms diff 57.7µs
align 65536 pre 1.36ms on 1.35ms post 1.33ms diff 4.67µs
align 32768 pre 1.32ms on 1.38ms post 1.34ms diff 44.1µs
merkaba:~> /tmp/flashbench -a /dev/sdb
align 4294967296 pre 1.36ms on 1.49ms post 1.34ms diff 139µs
align 2147483648 pre 1.26ms on 1.48ms post 1.27ms diff 213µs
align 1073741824 pre 1.26ms on 1.45ms post 1.33ms diff 164µs
align 536870912 pre 1.22ms on 1.46ms post 1.35ms diff 173µs
align 268435456 pre 1.34ms on 1.5ms post 1.31ms diff 172µs
align 134217728 pre 1.34ms on 1.48ms post 1.31ms diff 157µs
align 67108864 pre 1.29ms on 1.46ms post 1.34ms diff 142µs
align 33554432 pre 1.28ms on 1.47ms post 1.31ms diff 173µs
align 16777216 pre 1.26ms on 1.48ms post 1.37ms diff 168µs
align 8388608 pre 1.31ms on 1.47ms post 1.36ms diff 139µs
align 4194304 pre 1.26ms on 1.53ms post 1.33ms diff 237µs
align 2097152 pre 1.34ms on 1.4ms post 1.36ms diff 56.4µs
align 1048576 pre 1.32ms on 1.35ms post 1.37ms diff 638ns

here?

align 524288 pre 1.29ms on 1.47ms post 1.45ms diff 98.1µs
align 262144 pre 1.35ms on 1.38ms post 1.42ms diff -11916n
align 131072 pre 1.32ms on 1.46ms post 1.4ms diff 100µs
align 65536 pre 1.35ms on 1.42ms post 1.43ms diff 30.8µs
align 32768 pre 1.31ms on 1.37ms post 1.33ms diff 51µs
merkaba:~> /tmp/flashbench -a /dev/sdb
align 4294967296 pre 1.26ms on 1.49ms post 1.27ms diff 222µs
align 2147483648 pre 1.25ms on 1.41ms post 1.37ms diff 97.3µs
align 1073741824 pre 1.26ms on 1.47ms post 1.31ms diff 186µs
align 536870912 pre 1.25ms on 1.42ms post 1.32ms diff 132µs
align 268435456 pre 1.2ms on 1.44ms post 1.29ms diff 195µs
align 134217728 pre 1.27ms on 1.43ms post 1.34ms diff 118µs
align 67108864 pre 1.25ms on 1.45ms post 1.31ms diff 165µs
align 33554432 pre 1.22ms on 1.36ms post 1.25ms diff 124µs
align 16777216 pre 1.24ms on 1.44ms post 1.26ms diff 191µs
align 8388608 pre 1.22ms on 1.39ms post 1.23ms diff 164µs
align 4194304 pre 1.23ms on 1.43ms post 1.3ms diff 171µs
align 2097152 pre 1.26ms on 1.3ms post 1.32ms diff 16.7µs
align 1048576 pre 1.26ms on 1.27ms post 1.26ms diff 7.91µs

here?

align 524288 pre 1.24ms on 1.3ms post 1.3ms diff 29.2µs
align 262144 pre 1.25ms on 1.3ms post 1.28ms diff 28.2µs
align 131072 pre 1.25ms on 1.29ms post 1.28ms diff 24.8µs
align 65536 pre 1.15ms on 1.24ms post 1.26ms diff 34.5µs
align 32768 pre 1.17ms on 1.3ms post 1.26ms diff 82.6µs

Thing is that me here is not always at the same place :)

> With the correct guess, compare the performance you get using
>
> $ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
> $ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
> $ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
> $ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
> $ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
> $ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}

I omit this for now, cause I am not yet sure about the correct guess.

> The first one of those should always be the fastest, hopefully followed by
> some that are equally fast and then some much slower ones (especially for the
> smaller block sizes). The "active_logs=N" mount option should be one less
> than the highest number above that is still "fast", and only "2", "4" and "6"
> are valid at the moment. If you are lucky, your device is still fast with
> "--open-au-nr=7" and slow only for higher numbers, then the default of "6"
> is ok.
>
> If the erase size is larger than 2 MB, then you have to "-s" option in
> mkfs.f2fs to configure how many 2 MB segments there are in one erase block.
> For a 2 GB USB stick, I would guess that the erase block size is 1, 2 or
> 4 MB. Newer (larger) sticks will have larger erase blocks that may also
> be a multiple of 3 MB (3, 6, 12, or 24).

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-12 16:57:15

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Monday 12 November 2012, Martin Steigerwald wrote:
> Am Samstag, 10. November 2012 schrieb Arnd Bergmann:

> > I would also recommend using flashbench to find out the optimum parameters
> > for your device. You can download it from
> > git://git.linaro.org/people/arnd/flashbench.git
> > In the long run, we should automate those tests and make them part of
> > mkfs.f2fs, but for now, try to find out the erase block size and the number
> > of concurrently used erase blocks on your device using a timing attack
> > in flashbench. The README file in there explains how to interpret the
> > results from "./flashbench -a /dev/sdb --blocksize=1024" to guess
> > the erase block size, although that sometimes doesn't work.
>
> Why do I use a blocksize of 1024 if the kernel reports me 512 byte blocks?

The blocksize you pass here is the size of writes that flashbench sends to the
kernel. Because of the algorithm used by flashbench, two hardware blocks
is the smallest size you can use here, and larger block tend to be less reliable
for this test case. I should probably change the default.

> [ 3112.144086] scsi9 : usb-storage 1-1.1:1.0
> [ 3113.145968] scsi 9:0:0:0: Direct-Access TinyDisk 2007-05-12 0.00 PQ: 0 ANSI: 2
> [ 3113.146476] sd 9:0:0:0: Attached scsi generic sg2 type 0
> [ 3113.147935] sd 9:0:0:0: [sdb] 4095999 512-byte logical blocks: (2.09 GB/1.95 GiB)
> [ 3113.148935] sd 9:0:0:0: [sdb] Write Protect is off
>
>
> And how do reads give information about erase block size? Wouldn´t writes me
> more conclusive for that? (Having to erase one versus two erase blocks?)

The --open-au tests can be more reliable, but also take more time and are
harder to understand. Using this test is faster and often gives an easy
answer even without destroying data on the device.

> Hmmm, I get very varying results here with said USB stick:
>
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.1ms on 1.1ms post 1.08ms diff 13µs
> align 268435456 pre 1.2ms on 1.19ms post 1.16ms diff 11.6µs
> align 134217728 pre 1.12ms on 1.14ms post 1.15ms diff 9.51µs
> align 67108864 pre 1.12ms on 1.15ms post 1.12ms diff 29.9µs
> align 33554432 pre 1.11ms on 1.17ms post 1.13ms diff 49µs
> align 16777216 pre 1.14ms on 1.16ms post 1.15ms diff 22.4µs
> align 8388608 pre 1.12ms on 1.09ms post 1.06ms diff -2053ns
> align 4194304 pre 1.13ms on 1.16ms post 1.14ms diff 21.7µs
> align 2097152 pre 1.11ms on 1.08ms post 1.1ms diff -18488n
> align 1048576 pre 1.11ms on 1.11ms post 1.11ms diff -2461ns
> align 524288 pre 1.15ms on 1.17ms post 1.1ms diff 45.4µs
> align 262144 pre 1.11ms on 1.13ms post 1.13ms diff 12µs
> align 131072 pre 1.1ms on 1.09ms post 1.16ms diff -38025n
> align 65536 pre 1.09ms on 1.08ms post 1.11ms diff -21353n
> align 32768 pre 1.1ms on 1.08ms post 1.11ms diff -23854n
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.11ms on 1.13ms post 1.13ms diff 10.6µs
> align 268435456 pre 1.12ms on 1.2ms post 1.17ms diff 61.4µs
> align 134217728 pre 1.14ms on 1.19ms post 1.15ms diff 46.8µs
> align 67108864 pre 1.08ms on 1.15ms post 1.08ms diff 63.8µs
> align 33554432 pre 1.09ms on 1.08ms post 1.09ms diff -4761ns
> align 16777216 pre 1.12ms on 1.14ms post 1.07ms diff 41.4µs
> align 8388608 pre 1.1ms on 1.1ms post 1.09ms diff 7.48µs
> align 4194304 pre 1.08ms on 1.1ms post 1.1ms diff 10.1µs
> align 2097152 pre 1.1ms on 1.11ms post 1.1ms diff 16µs
> align 1048576 pre 1.09ms on 1.1ms post 1.07ms diff 15.5µs
> align 524288 pre 1.12ms on 1.12ms post 1.1ms diff 11µs
> align 262144 pre 1.13ms on 1.13ms post 1.1ms diff 21.6µs
> align 131072 pre 1.11ms on 1.13ms post 1.12ms diff 17.9µs
> align 65536 pre 1.07ms on 1.1ms post 1.1ms diff 11.6µs
> align 32768 pre 1.09ms on 1.11ms post 1.13ms diff -5131ns
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.2ms on 1.18ms post 1.21ms diff -27496n
> align 268435456 pre 1.22ms on 1.21ms post 1.24ms diff -18972n
> align 134217728 pre 1.15ms on 1.19ms post 1.14ms diff 42.5µs
> align 67108864 pre 1.08ms on 1.09ms post 1.08ms diff 5.29µs
> align 33554432 pre 1.18ms on 1.19ms post 1.18ms diff 9.25µs
> align 16777216 pre 1.18ms on 1.22ms post 1.17ms diff 48.6µs
> align 8388608 pre 1.14ms on 1.17ms post 1.19ms diff 4.36µs
> align 4194304 pre 1.16ms on 1.2ms post 1.11ms diff 65.8µs
> align 2097152 pre 1.13ms on 1.09ms post 1.12ms diff -37718n
> align 1048576 pre 1.15ms on 1.2ms post 1.18ms diff 34.9µs
> align 524288 pre 1.14ms on 1.19ms post 1.16ms diff 41.5µs
> align 262144 pre 1.19ms on 1.12ms post 1.15ms diff -52725n
> align 131072 pre 1.21ms on 1.11ms post 1.14ms diff -68522n
> align 65536 pre 1.21ms on 1.13ms post 1.18ms diff -64248n
> align 32768 pre 1.14ms on 1.25ms post 1.12ms diff 116µs
>
> Even when I apply the explaination of the README I do not seem to get a
> clear picture of the stick erase block size.
>
> The values above seem to indicate to me: I don´t care about alignment at all.

I think it's more a case of a device where reading does not easily reveal
the erase block boundaries, because the variance between multiple reads
is much higher than between different positions. You can try again using
"--blocksize=1024 --count=100", which will increase the accuracy of the
test.

On the other hand, the device size of "4095999 512-byte logical blocks"
is quite suspicious, because it's not an even number, where it should
be a multiple of erase blocks. It is one less sector than 1000 2MB blocks
(or 500 4MB blocks, for that matter), but it's not clear if that one
block is missing at the start or at the end of the drive.

> With another flash, likely slower Intenso 4GB stick I get:
>
> [ 3672.512143] scsi 10:0:0:0: Direct-Access Ut165 USB2FlashStorage 0.00 PQ: 0 ANSI: 2
> [ 3672.514469] sd 10:0:0:0: Attached scsi generic sg2 type 0
> [ 3672.514991] sd 10:0:0:0: [sdb] 7897088 512-byte logical blocks: (4.04 GB/3.76 GiB)
> […]

$ factor 7897088
7897088: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 241

Slightly more helpful, this one has 241 32MB-blocks, so at least we know that the
erase block size is not larger than 32MB (which would be very unlikely anyway)
and not a multiple of 3.

> align 16777216 pre 939µs on 903µs post 880µs diff -5972ns
> align 8388608 pre 900µs on 914µs post 923µs diff 2.42µs
> align 4194304 pre 894µs on 886µs post 882µs diff -1563ns
>
> here?
>
> align 2097152 pre 829µs on 890µs post 874µs diff 37.8µs
> align 1048576 pre 899µs on 882µs post 843µs diff 11.1µs
> align 524288 pre 890µs on 887µs post 902µs diff -9005ns
> align 262144 pre 887µs on 887µs post 898µs diff -5474ns
> align 131072 pre 928µs on 895µs post 914µs diff -26028n
> align 65536 pre 898µs on 898µs post 894µs diff 2.59µs
> align 32768 pre 884µs on 891µs post 901µs diff -1284ns
>
>
> Similar picture. The diffs seem to be mostly quite small with only some
> micro seconds. Or am I misreading something?

Same thing, try again with the options I listed above.

> Then with a quite fast one 16 GB Transcend.
>
> [ 4055.393399] sd 11:0:0:0: Attached scsi generic sg2 type 0
> [ 4055.394729] sd 11:0:0:0: [sdb] 31375360 512-byte logical blocks: (16.0 GB/14.9 GiB)
> [ 4055.395262] sd 11:0:0:0: [sdb] Write Protect is off

$ factor 31375360
31375360: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 383

That would be 5*383*16MB, so the erase block size will be a fraction of 16MB.

> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.28ms on 1.48ms post 1.33ms diff 179µs
> align 2147483648 pre 1.32ms on 1.51ms post 1.33ms diff 181µs
> align 1073741824 pre 1.31ms on 1.46ms post 1.35ms diff 132µs
> align 536870912 pre 1.27ms on 1.52ms post 1.33ms diff 228µs
> align 268435456 pre 1.28ms on 1.46ms post 1.31ms diff 161µs
> align 134217728 pre 1.28ms on 1.44ms post 1.37ms diff 120µs
> align 67108864 pre 1.27ms on 1.44ms post 1.34ms diff 133µs
> align 33554432 pre 1.24ms on 1.42ms post 1.31ms diff 150µs
> align 16777216 pre 1.23ms on 1.46ms post 1.26ms diff 218µs
> align 8388608 pre 1.31ms on 1.5ms post 1.33ms diff 180µs
> align 4194304 pre 1.27ms on 1.45ms post 1.36ms diff 135µs
> align 2097152 pre 1.29ms on 1.37ms post 1.39ms diff 33.7µs
>
> here?
>
> align 1048576 pre 1.31ms on 1.44ms post 1.35ms diff 115µs
> align 524288 pre 1.33ms on 1.39ms post 1.48ms diff -12297n
> align 262144 pre 1.36ms on 1.42ms post 1.4ms diff 45.6µs
> align 131072 pre 1.37ms on 1.44ms post 1.4ms diff 57.7µs
> align 65536 pre 1.36ms on 1.35ms post 1.33ms diff 4.67µs
> align 32768 pre 1.32ms on 1.38ms post 1.34ms diff 44.1µs
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.36ms on 1.49ms post 1.34ms diff 139µs
> align 2147483648 pre 1.26ms on 1.48ms post 1.27ms diff 213µs
> align 1073741824 pre 1.26ms on 1.45ms post 1.33ms diff 164µs
> align 536870912 pre 1.22ms on 1.46ms post 1.35ms diff 173µs
> align 268435456 pre 1.34ms on 1.5ms post 1.31ms diff 172µs
> align 134217728 pre 1.34ms on 1.48ms post 1.31ms diff 157µs
> align 67108864 pre 1.29ms on 1.46ms post 1.34ms diff 142µs
> align 33554432 pre 1.28ms on 1.47ms post 1.31ms diff 173µs
> align 16777216 pre 1.26ms on 1.48ms post 1.37ms diff 168µs
> align 8388608 pre 1.31ms on 1.47ms post 1.36ms diff 139µs
> align 4194304 pre 1.26ms on 1.53ms post 1.33ms diff 237µs
> align 2097152 pre 1.34ms on 1.4ms post 1.36ms diff 56.4µs
> align 1048576 pre 1.32ms on 1.35ms post 1.37ms diff 638ns
>
> here?
>
> align 524288 pre 1.29ms on 1.47ms post 1.45ms diff 98.1µs
> align 262144 pre 1.35ms on 1.38ms post 1.42ms diff -11916n
> align 131072 pre 1.32ms on 1.46ms post 1.4ms diff 100µs
> align 65536 pre 1.35ms on 1.42ms post 1.43ms diff 30.8µs
> align 32768 pre 1.31ms on 1.37ms post 1.33ms diff 51µs
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.26ms on 1.49ms post 1.27ms diff 222µs
> align 2147483648 pre 1.25ms on 1.41ms post 1.37ms diff 97.3µs
> align 1073741824 pre 1.26ms on 1.47ms post 1.31ms diff 186µs
> align 536870912 pre 1.25ms on 1.42ms post 1.32ms diff 132µs
> align 268435456 pre 1.2ms on 1.44ms post 1.29ms diff 195µs
> align 134217728 pre 1.27ms on 1.43ms post 1.34ms diff 118µs
> align 67108864 pre 1.25ms on 1.45ms post 1.31ms diff 165µs
> align 33554432 pre 1.22ms on 1.36ms post 1.25ms diff 124µs
> align 16777216 pre 1.24ms on 1.44ms post 1.26ms diff 191µs
> align 8388608 pre 1.22ms on 1.39ms post 1.23ms diff 164µs
> align 4194304 pre 1.23ms on 1.43ms post 1.3ms diff 171µs
> align 2097152 pre 1.26ms on 1.3ms post 1.32ms diff 16.7µs
> align 1048576 pre 1.26ms on 1.27ms post 1.26ms diff 7.91µs
>
> here?
>
> align 524288 pre 1.24ms on 1.3ms post 1.3ms diff 29.2µs
> align 262144 pre 1.25ms on 1.3ms post 1.28ms diff 28.2µs
> align 131072 pre 1.25ms on 1.29ms post 1.28ms diff 24.8µs
> align 65536 pre 1.15ms on 1.24ms post 1.26ms diff 34.5µs
> align 32768 pre 1.17ms on 1.3ms post 1.26ms diff 82.6µs

This one is fairly deterministic, and I would assume it's 4MB, which always
has a much higher number in the last column than the 2MB one.
For a fast 16 GB stick, I also wouldn't expect smaller than 4 MB erase blocks.

> Thing is that me here is not always at the same place :)

If you add a '--count=N' argument, you can have flashbench run the test more
often and average between the runs. The default is 8.

> > With the correct guess, compare the performance you get using
> >
> > $ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}
>
> I omit this for now, cause I am not yet sure about the correct guess.

You can also try this test to find out the erase block size if the -a test fails.
Start with the largest possible value you'd expect (16 MB for a modern and fast
USB stick, less if it's older or smaller), and use --open-au-nr=1 to get a baseline:

./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]

Every device should be able to handle this nicely with maximum throughput. The default is
to start the test at 16 MB into the device to get out of the way of a potential FAT
optimized area. You can change that offset to find where an erase block boundary is.
Adding '--offset=[24*1024*1024]' will still be fast if the erase block size is 8 MB,
but get slower and have more jitter if the size is actually 16 MB, because now we write
a 16 MB section of the drive with an 8 MB misalignment. The next ones to try after that
would be 20, 18, 17, 16.5, etc MB, to which will be slow for an 8,4, 2, an 1 MB erase
block size, respectively. You can also reduce the --erasesize argument there and do

./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[16*1024*1024 --offset=[24*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[8*1024*1024 --offset=[20*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[4*1024*1024 --offset=[18*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[2*1024*1024 --offset=[17*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[1*1024*1024 --offset=[33*512*1024]

If you have the result from the other test to figure out the maximum value for
'--open-au-nr=N', using that number here will make this test more reliable as well.

Arnd

2012-11-14 15:57:45

by Martin Steigerwald

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

Am Montag, 12. November 2012 schrieb Arnd Bergmann:
> On Monday 12 November 2012, Martin Steigerwald wrote:
> > Am Samstag, 10. November 2012 schrieb Arnd Bergmann:
>
> > > I would also recommend using flashbench to find out the optimum parameters
> > > for your device. You can download it from
> > > git://git.linaro.org/people/arnd/flashbench.git
> > > In the long run, we should automate those tests and make them part of
> > > mkfs.f2fs, but for now, try to find out the erase block size and the number
> > > of concurrently used erase blocks on your device using a timing attack
> > > in flashbench. The README file in there explains how to interpret the
> > > results from "./flashbench -a /dev/sdb --blocksize=1024" to guess
> > > the erase block size, although that sometimes doesn't work.
> >
> > Why do I use a blocksize of 1024 if the kernel reports me 512 byte blocks?
>
> The blocksize you pass here is the size of writes that flashbench sends to the
> kernel. Because of the algorithm used by flashbench, two hardware blocks
> is the smallest size you can use here, and larger block tend to be less reliable
> for this test case. I should probably change the default.
>
> > [ 3112.144086] scsi9 : usb-storage 1-1.1:1.0
> > [ 3113.145968] scsi 9:0:0:0: Direct-Access TinyDisk 2007-05-12 0.00 PQ: 0 ANSI: 2
> > [ 3113.146476] sd 9:0:0:0: Attached scsi generic sg2 type 0
> > [ 3113.147935] sd 9:0:0:0: [sdb] 4095999 512-byte logical blocks: (2.09 GB/1.95 GiB)
> > [ 3113.148935] sd 9:0:0:0: [sdb] Write Protect is off
> >
> >
> > And how do reads give information about erase block size? Wouldn´t writes me
> > more conclusive for that? (Having to erase one versus two erase blocks?)
>
> The --open-au tests can be more reliable, but also take more time and are
> harder to understand. Using this test is faster and often gives an easy
> answer even without destroying data on the device.
>
>
> > Hmmm, I get very varying results here with said USB stick:
> >
> > merkaba:~> /tmp/flashbench -a /dev/sdb
> > align 536870912 pre 1.1ms on 1.1ms post 1.08ms diff 13µs
> > align 268435456 pre 1.2ms on 1.19ms post 1.16ms diff 11.6µs
> > align 134217728 pre 1.12ms on 1.14ms post 1.15ms diff 9.51µs
> > align 67108864 pre 1.12ms on 1.15ms post 1.12ms diff 29.9µs
> > align 33554432 pre 1.11ms on 1.17ms post 1.13ms diff 49µs
> > align 16777216 pre 1.14ms on 1.16ms post 1.15ms diff 22.4µs
> > align 8388608 pre 1.12ms on 1.09ms post 1.06ms diff -2053ns
> > align 4194304 pre 1.13ms on 1.16ms post 1.14ms diff 21.7µs
> > align 2097152 pre 1.11ms on 1.08ms post 1.1ms diff -18488n
> > align 1048576 pre 1.11ms on 1.11ms post 1.11ms diff -2461ns
> > align 524288 pre 1.15ms on 1.17ms post 1.1ms diff 45.4µs
> > align 262144 pre 1.11ms on 1.13ms post 1.13ms diff 12µs
> > align 131072 pre 1.1ms on 1.09ms post 1.16ms diff -38025n
> > align 65536 pre 1.09ms on 1.08ms post 1.11ms diff -21353n
> > align 32768 pre 1.1ms on 1.08ms post 1.11ms diff -23854n
> > merkaba:~> /tmp/flashbench -a /dev/sdb
> > align 536870912 pre 1.11ms on 1.13ms post 1.13ms diff 10.6µs
> > align 268435456 pre 1.12ms on 1.2ms post 1.17ms diff 61.4µs
> > align 134217728 pre 1.14ms on 1.19ms post 1.15ms diff 46.8µs
> > align 67108864 pre 1.08ms on 1.15ms post 1.08ms diff 63.8µs
> > align 33554432 pre 1.09ms on 1.08ms post 1.09ms diff -4761ns
> > align 16777216 pre 1.12ms on 1.14ms post 1.07ms diff 41.4µs
> > align 8388608 pre 1.1ms on 1.1ms post 1.09ms diff 7.48µs
> > align 4194304 pre 1.08ms on 1.1ms post 1.1ms diff 10.1µs
> > align 2097152 pre 1.1ms on 1.11ms post 1.1ms diff 16µs
> > align 1048576 pre 1.09ms on 1.1ms post 1.07ms diff 15.5µs
> > align 524288 pre 1.12ms on 1.12ms post 1.1ms diff 11µs
> > align 262144 pre 1.13ms on 1.13ms post 1.1ms diff 21.6µs
> > align 131072 pre 1.11ms on 1.13ms post 1.12ms diff 17.9µs
> > align 65536 pre 1.07ms on 1.1ms post 1.1ms diff 11.6µs
> > align 32768 pre 1.09ms on 1.11ms post 1.13ms diff -5131ns
> > merkaba:~> /tmp/flashbench -a /dev/sdb
> > align 536870912 pre 1.2ms on 1.18ms post 1.21ms diff -27496n
> > align 268435456 pre 1.22ms on 1.21ms post 1.24ms diff -18972n
> > align 134217728 pre 1.15ms on 1.19ms post 1.14ms diff 42.5µs
> > align 67108864 pre 1.08ms on 1.09ms post 1.08ms diff 5.29µs
> > align 33554432 pre 1.18ms on 1.19ms post 1.18ms diff 9.25µs
> > align 16777216 pre 1.18ms on 1.22ms post 1.17ms diff 48.6µs
> > align 8388608 pre 1.14ms on 1.17ms post 1.19ms diff 4.36µs
> > align 4194304 pre 1.16ms on 1.2ms post 1.11ms diff 65.8µs
> > align 2097152 pre 1.13ms on 1.09ms post 1.12ms diff -37718n
> > align 1048576 pre 1.15ms on 1.2ms post 1.18ms diff 34.9µs
> > align 524288 pre 1.14ms on 1.19ms post 1.16ms diff 41.5µs
> > align 262144 pre 1.19ms on 1.12ms post 1.15ms diff -52725n
> > align 131072 pre 1.21ms on 1.11ms post 1.14ms diff -68522n
> > align 65536 pre 1.21ms on 1.13ms post 1.18ms diff -64248n
> > align 32768 pre 1.14ms on 1.25ms post 1.12ms diff 116µs
> >
> > Even when I apply the explaination of the README I do not seem to get a
> > clear picture of the stick erase block size.
> >
> > The values above seem to indicate to me: I don´t care about alignment at all.
>
> I think it's more a case of a device where reading does not easily reveal
> the erase block boundaries, because the variance between multiple reads
> is much higher than between different positions. You can try again using
> "--blocksize=1024 --count=100", which will increase the accuracy of the
> test.
>
> On the other hand, the device size of "4095999 512-byte logical blocks"
> is quite suspicious, because it's not an even number, where it should
> be a multiple of erase blocks. It is one less sector than 1000 2MB blocks
> (or 500 4MB blocks, for that matter), but it's not clear if that one
> block is missing at the start or at the end of the drive.

Just for this first flash drive, I think the erase block size if 4 MiB. The
-a count=100/100 tests did not show any obvious results, but the
--open-au ones did, I think. I would use two open allocation units (AUs).

Maybe also 1 AU, cause 64 KiB sized accesses are faster that way?

Well I tend to use one AU. So that device would be more suitable for FAT
than for BTRFS. Or more suitable for F2FS that is.

What do you think?

Only thing that seems to contradict this is the test with different
alignments below.

merkaba:~#254> /tmp/flashbench -a /dev/sdb --count=100
align 536870912 pre 1.06ms on 1.07ms post 1.04ms diff 14.6µs
align 268435456 pre 1.09ms on 1.1ms post 1.09ms diff 11.3µs
align 134217728 pre 1.09ms on 1.09ms post 1.1ms diff -87ns
align 67108864 pre 1.05ms on 1.06ms post 1.03ms diff 15.9µs
align 33554432 pre 1.06ms on 1.06ms post 1.03ms diff 18.7µs
align 16777216 pre 1.05ms on 1.05ms post 1.03ms diff 13.3µs
align 8388608 pre 1.05ms on 1.06ms post 1.04ms diff 9.03µs
align 4194304 pre 1.06ms on 1.06ms post 1.04ms diff 8.56µs
align 2097152 pre 1.06ms on 1.05ms post 1.05ms diff 2.02µs
align 1048576 pre 1.05ms on 1.04ms post 1.06ms diff -11524n
align 524288 pre 1.05ms on 1.05ms post 1.04ms diff 642ns
align 262144 pre 1.04ms on 1.04ms post 1.04ms diff -604ns
align 131072 pre 1.03ms on 1.04ms post 1.04ms diff 2.79µs
align 65536 pre 1.04ms on 1.05ms post 1.05ms diff 7.2µs
align 32768 pre 1.05ms on 1.05ms post 1.05ms diff -4475ns
merkaba:~> /tmp/flashbench -a /dev/sdb --count=1000
align 536870912 pre 1.03ms on 1.05ms post 1.02ms diff 20.3µs
align 268435456 pre 1.06ms on 1.05ms post 1.04ms diff 3.14µs
align 134217728 pre 1.07ms on 1.08ms post 1.05ms diff 16.1µs
align 67108864 pre 1.03ms on 1.03ms post 1.02ms diff 11µs
align 33554432 pre 1.02ms on 1.03ms post 1.01ms diff 10.3µs
align 16777216 pre 1.03ms on 1.04ms post 1.02ms diff 9.68µs
align 8388608 pre 1.04ms on 1.03ms post 1.02ms diff 6.45µs
align 4194304 pre 1.03ms on 1.04ms post 1.02ms diff 9.12µs
align 2097152 pre 1.04ms on 1.04ms post 1.02ms diff 15.4µs
align 1048576 pre 1.03ms on 1.03ms post 1.03ms diff -1590ns
align 524288 pre 1.03ms on 1.03ms post 1.03ms diff -835ns
align 262144 pre 1.04ms on 1.04ms post 1.03ms diff 1.25µs
align 131072 pre 1.03ms on 1.03ms post 1.03ms diff -3477ns
align 65536 pre 1.03ms on 1.03ms post 1.03ms diff 191ns
align 32768 pre 1.03ms on 1.04ms post 1.03ms diff 4.06µs
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]
16MiB 15M/s
8MiB 3.44M/s
4MiB 13.9M/s
2MiB 13M/s
1MiB 15M/s
512KiB 3.3M/s
256KiB 6.55M/s
128KiB 4.17M/s
64KiB 13.5M/s
32KiB 2.15M/s
16KiB 1.83M/s
8KiB 1.25M/s
4KiB 731K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[8*1024*1024]
8MiB 14.6M/s
4MiB 8.11M/s
2MiB 12.5M/s
1MiB 15.1M/s
512KiB 3.29M/s
256KiB 6.54M/s
128KiB 4.16M/s
64KiB 13.4M/s
32KiB 2.14M/s
16KiB 1.81M/s
8KiB 1.23M/s
4KiB 722K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[4*1024*1024]
4MiB 14M/s
2MiB 13M/s
1MiB 15M/s
512KiB 3.26M/s
256KiB 6.57M/s
128KiB 4.2M/s
64KiB 13.3M/s
32KiB 2.13M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 724K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[2*1024*1024]
2MiB 13.1M/s
1MiB 15.2M/s
512KiB 3.22M/s
256KiB 6.57M/s
128KiB 4.2M/s
64KiB 13.3M/s
32KiB 2.11M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 725K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[2*1024*1024]
2MiB 13.1M/s
1MiB 14.9M/s
512KiB 3.21M/s
256KiB 6.61M/s
128KiB 4.19M/s
64KiB 13.3M/s
32KiB 2.11M/s
16KiB 1.82M/s
8KiB 1.23M/s
4KiB 726K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[1*1024*1024]
1MiB 14.9M/s
512KiB 3.12M/s
256KiB 6.64M/s
128KiB 4.2M/s
64KiB 13.4M/s
32KiB 2.07M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 725K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=4096 --erasesize=$[4*1024*1024]
4MiB 14.2M/s
2MiB 13.1M/s
1MiB 5.58M/s
512KiB 3.43M/s
256KiB 6.58M/s
128KiB 4.18M/s
64KiB 5.06M/s
32KiB 2.14M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 724K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=4096 --erasesize=$[16*1024*1024]
16MiB 5.68M/s
8MiB 4.3M/s
4MiB 14.2M/s
2MiB 13.1M/s
1MiB 5.6M/s
512KiB 3.35M/s
256KiB 6.61M/s
128KiB 4.19M/s
64KiB 5.07M/s
32KiB 2.16M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 726K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=$[16*1024*1024]
16MiB 7.18M/s
8MiB 14.6M/s
4MiB 14.1M/s
2MiB 13M/s
1MiB 6.39M/s
512KiB 8.77M/s
256KiB 6.13M/s
128KiB 3.81M/s
64KiB 2.37M/s
32KiB 1.15M/s
16KiB 648K/s
8KiB 344K/s
4KiB 180K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]
16MiB 15.3M/s
8MiB 3.48M/s
4MiB 14.3M/s
2MiB 13.2M/s
1MiB 15.2M/s
512KiB 3.33M/s
256KiB 6.64M/s
128KiB 4.2M/s
64KiB 13.6M/s
32KiB 2.15M/s
16KiB 1.83M/s
8KiB 1.24M/s
4KiB 727K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=4096 --erasesize=$[16*1024*1024]
16MiB 5.72M/s
8MiB 4.33M/s
4MiB 14.3M/s
2MiB 13.3M/s
1MiB 5.66M/s
512KiB 3.38M/s
256KiB 6.68M/s
128KiB 4.24M/s
64KiB 5.12M/s
32KiB 2.17M/s
16KiB 1.83M/s
8KiB 1.24M/s
4KiB 729K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[4*1024*1024]
4MiB 14M/s
2MiB 13.3M/s
1MiB 15.3M/s
512KiB 3.29M/s
256KiB 6.68M/s
128KiB 4.22M/s
64KiB 13.7M/s
32KiB 2.15M/s
16KiB 1.83M/s
8KiB 1.24M/s
4KiB 729K/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=4096 --erasesize=$[4*1024*1024]
4MiB 14.1M/s
2MiB 13.1M/s
1MiB 5.62M/s
512KiB 3.43M/s
256KiB 6.63M/s
128KiB 4.2M/s
64KiB 5.11M/s
32KiB 2.15M/s
16KiB 1.82M/s
8KiB 1.24M/s
4KiB 727K/s

merkaba:~#130> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=$[16*1024*1024]
16MiB 15.2M/s
8MiB 3.46M/s
4MiB 14.2M/s
2MiB 13.1M/s
1MiB 15.1M/s
512KiB 3.31M/s
256KiB 6.59M/s
128KiB 4.19M/s
64KiB 13.5M/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=65536 --erasesize=$[16*1024*1024]
16MiB 5.68M/s
8MiB 4.31M/s
4MiB 14.2M/s
2MiB 13.2M/s
1MiB 5.62M/s
512KiB 3.36M/s
256KiB 6.63M/s
128KiB 4.21M/s
64KiB 5.09M/s

But then I tried with offset and get:

> > > With the correct guess, compare the performance you get using
> > >
> > > $ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
> > > $ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
> > > $ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
> > > $ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
> > > $ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
> > > $ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}
> >
> > I omit this for now, cause I am not yet sure about the correct guess.
>
> You can also try this test to find out the erase block size if the -a test fails.
> Start with the largest possible value you'd expect (16 MB for a modern and fast
> USB stick, less if it's older or smaller), and use --open-au-nr=1 to get a baseline:
>
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]
>
> Every device should be able to handle this nicely with maximum throughput. The default is
> to start the test at 16 MB into the device to get out of the way of a potential FAT
> optimized area. You can change that offset to find where an erase block boundary is.
> Adding '--offset=[24*1024*1024]' will still be fast if the erase block size is 8 MB,
> but get slower and have more jitter if the size is actually 16 MB, because now we write
> a 16 MB section of the drive with an 8 MB misalignment. The next ones to try after that
> would be 20, 18, 17, 16.5, etc MB, to which will be slow for an 8,4, 2, an 1 MB erase
> block size, respectively. You can also reduce the --erasesize argument there and do
>
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[16*1024*1024 --offset=[24*1024*1024]
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[8*1024*1024 --offset=[20*1024*1024]
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[4*1024*1024 --offset=[18*1024*1024]
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[2*1024*1024 --offset=[17*1024*1024]
> ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[1*1024*1024 --offset=[33*512*1024]
>
> If you have the result from the other test to figure out the maximum value for
> '--open-au-nr=N', using that number here will make this test more reliable as well.

merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[8*1024*1024] --erasesize=$[16*1024*1024]
16MiB 15.1M/s
8MiB 3.45M/s
4MiB 14M/s
2MiB 13.1M/s
1MiB 15.2M/s
512KiB 3.31M/s
256KiB 6.55M/s
128KiB 4.18M/s
64KiB 13.4M/s
32KiB 2.14M/s
16KiB 1.81M/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[1*1024*1024] --erasesize=$[4*1024*1024]
4MiB 14.1M/s
2MiB 13M/s
1MiB 14.9M/s
512KiB 3.25M/s
256KiB 6.56M/s
128KiB 4.16M/s
64KiB 13.4M/s
32KiB 2.13M/s
16KiB 1.81M/s
merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[2*1024*1024] --erasesize=$[4*1024*1024]
4MiB 14M/s
2MiB 13M/s
1MiB 15.1M/s
512KiB 3.25M/s
256KiB 6.58M/s
128KiB 4.18M/s
64KiB 13.5M/s
32KiB 2.13M/s
16KiB 1.82M/s

So this does seem to me that the device quite likes 4 MiB sized, but doesn´t
care too much about their alignment?

merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[78*1024] --erasesize=$[4*1024*1024]
4MiB 14.2M/s
2MiB 13.3M/s
1MiB 15.1M/s
512KiB 3.42M/s
256KiB 6.6M/s
128KiB 4.22M/s
64KiB 13.5M/s
32KiB 2.17M/s
16KiB 1.84M/s

Its seem thats a kinda special USB stick.

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-16 21:26:26

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Wednesday 14 November 2012, Martin Steigerwald wrote:
> Am Montag, 12. November 2012 schrieb Arnd Bergmann:
> > On Monday 12 November 2012, Martin Steigerwald wrote:
> > > Am Samstag, 10. November 2012 schrieb Arnd Bergmann:

> > > Even when I apply the explaination of the README I do not seem to get a
> > > clear picture of the stick erase block size.
> > >
> > > The values above seem to indicate to me: I don´t care about alignment at all.
> >
> > I think it's more a case of a device where reading does not easily reveal
> > the erase block boundaries, because the variance between multiple reads
> > is much higher than between different positions. You can try again using
> > "--blocksize=1024 --count=100", which will increase the accuracy of the
> > test.
> >
> > On the other hand, the device size of "4095999 512-byte logical blocks"
> > is quite suspicious, because it's not an even number, where it should
> > be a multiple of erase blocks. It is one less sector than 1000 2MB blocks
> > (or 500 4MB blocks, for that matter), but it's not clear if that one
> > block is missing at the start or at the end of the drive.
>
> Just for this first flash drive, I think the erase block size if 4 MiB. The
> -a count=100/100 tests did not show any obvious results, but the
> --open-au ones did, I think. I would use two open allocation units (AUs).
>
> Maybe also 1 AU, cause 64 KiB sized accesses are faster that way?
>
> Well I tend to use one AU. So that device would be more suitable for FAT
> than for BTRFS. Or more suitable for F2FS that is.
>
> What do you think?
>
> Only thing that seems to contradict this is the test with different
> alignments below.
>
>
> merkaba:~#254> /tmp/flashbench -a /dev/sdb --count=100

You should really pass "--blocksize=1024" here, which makes the results
much more accurate. Still, there are some devices where the -a test
doesn't give anything useful at all.

> align 536870912 pre 1.06ms on 1.07ms post 1.04ms diff 14.6µs
> align 268435456 pre 1.09ms on 1.1ms post 1.09ms diff 11.3µs
> align 134217728 pre 1.09ms on 1.09ms post 1.1ms diff -87ns
> align 67108864 pre 1.05ms on 1.06ms post 1.03ms diff 15.9µs
> align 33554432 pre 1.06ms on 1.06ms post 1.03ms diff 18.7µs
> align 16777216 pre 1.05ms on 1.05ms post 1.03ms diff 13.3µs
> align 8388608 pre 1.05ms on 1.06ms post 1.04ms diff 9.03µs
> align 4194304 pre 1.06ms on 1.06ms post 1.04ms diff 8.56µs
> align 2097152 pre 1.06ms on 1.05ms post 1.05ms diff 2.02µs
> align 1048576 pre 1.05ms on 1.04ms post 1.06ms diff -11524n
> align 524288 pre 1.05ms on 1.05ms post 1.04ms diff 642ns
> align 262144 pre 1.04ms on 1.04ms post 1.04ms diff -604ns
> align 131072 pre 1.03ms on 1.04ms post 1.04ms diff 2.79µs
> align 65536 pre 1.04ms on 1.05ms post 1.05ms diff 7.2µs
> align 32768 pre 1.05ms on 1.05ms post 1.05ms diff -4475ns

This looks like a 4 MB size.

> merkaba:~> /tmp/flashbench -a /dev/sdb --count=1000
> align 536870912 pre 1.03ms on 1.05ms post 1.02ms diff 20.3µs
> align 268435456 pre 1.06ms on 1.05ms post 1.04ms diff 3.14µs
> align 134217728 pre 1.07ms on 1.08ms post 1.05ms diff 16.1µs
> align 67108864 pre 1.03ms on 1.03ms post 1.02ms diff 11µs
> align 33554432 pre 1.02ms on 1.03ms post 1.01ms diff 10.3µs
> align 16777216 pre 1.03ms on 1.04ms post 1.02ms diff 9.68µs
> align 8388608 pre 1.04ms on 1.03ms post 1.02ms diff 6.45µs
> align 4194304 pre 1.03ms on 1.04ms post 1.02ms diff 9.12µs
> align 2097152 pre 1.04ms on 1.04ms post 1.02ms diff 15.4µs
> align 1048576 pre 1.03ms on 1.03ms post 1.03ms diff -1590ns
> align 524288 pre 1.03ms on 1.03ms post 1.03ms diff -835ns
> align 262144 pre 1.04ms on 1.04ms post 1.03ms diff 1.25µs
> align 131072 pre 1.03ms on 1.03ms post 1.03ms diff -3477ns
> align 65536 pre 1.03ms on 1.03ms post 1.03ms diff 191ns
> align 32768 pre 1.03ms on 1.04ms post 1.03ms diff 4.06µs

And this doesn't. I would guess 2 MB from the above.

> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=2 --blocksize=4096 --erasesize=$[16*1024*1024]
> 16MiB 5.68M/s
> 8MiB 4.3M/s
> 4MiB 14.2M/s
> 2MiB 13.1M/s
> 1MiB 5.6M/s
> 512KiB 3.35M/s
> 256KiB 6.61M/s
> 128KiB 4.19M/s
> 64KiB 5.07M/s
> 32KiB 2.16M/s
> 16KiB 1.82M/s
> 8KiB 1.24M/s
> 4KiB 726K/s
> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=$[16*1024*1024]
> 16MiB 7.18M/s
> 8MiB 14.6M/s
> 4MiB 14.1M/s
> 2MiB 13M/s
> 1MiB 6.39M/s
> 512KiB 8.77M/s
> 256KiB 6.13M/s
> 128KiB 3.81M/s
> 64KiB 2.37M/s
> 32KiB 1.15M/s
> 16KiB 648K/s
> 8KiB 344K/s
> 4KiB 180K/s

This shows clearly how the device cannot handle more than 2 erase blocks, as you correctly
pointed out. I'm guessing that it does have a FAT optimized area in the front, so it should
work fine if you mount f2fs with just two active logs.

> But then I tried with offset and get:
>
> > > > With the correct guess, compare the performance you get using
> > > >
> > > > $ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
> > > > $ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
> > > > $ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
> > > > $ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
> > > > $ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
> > > > $ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}
> > >
> > > I omit this for now, cause I am not yet sure about the correct guess.
> >
> > You can also try this test to find out the erase block size if the -a test fails.
> > Start with the largest possible value you'd expect (16 MB for a modern and fast
> > USB stick, less if it's older or smaller), and use --open-au-nr=1 to get a baseline:
> >
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]
> >
> > Every device should be able to handle this nicely with maximum throughput. The default is
> > to start the test at 16 MB into the device to get out of the way of a potential FAT
> > optimized area. You can change that offset to find where an erase block boundary is.
> > Adding '--offset=[24*1024*1024]' will still be fast if the erase block size is 8 MB,
> > but get slower and have more jitter if the size is actually 16 MB, because now we write
> > a 16 MB section of the drive with an 8 MB misalignment. The next ones to try after that
> > would be 20, 18, 17, 16.5, etc MB, to which will be slow for an 8,4, 2, an 1 MB erase
> > block size, respectively. You can also reduce the --erasesize argument there and do
> >
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[16*1024*1024 --offset=[24*1024*1024]
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[8*1024*1024 --offset=[20*1024*1024]
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[4*1024*1024 --offset=[18*1024*1024]
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[2*1024*1024 --offset=[17*1024*1024]
> > ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[1*1024*1024 --offset=[33*512*1024]
> >
> > If you have the result from the other test to figure out the maximum value for
> > '--open-au-nr=N', using that number here will make this test more reliable as well.
>
>
>
> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[8*1024*1024] --erasesize=$[16*1024*1024]
> 16MiB 15.1M/s
> 8MiB 3.45M/s
> 4MiB 14M/s
> 2MiB 13.1M/s
> 1MiB 15.2M/s
> 512KiB 3.31M/s
> 256KiB 6.55M/s
> 128KiB 4.18M/s
> 64KiB 13.4M/s
> 32KiB 2.14M/s
> 16KiB 1.81M/s
> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[1*1024*1024] --erasesize=$[4*1024*1024]
> 4MiB 14.1M/s
> 2MiB 13M/s
> 1MiB 14.9M/s
> 512KiB 3.25M/s
> 256KiB 6.56M/s
> 128KiB 4.16M/s
> 64KiB 13.4M/s
> 32KiB 2.13M/s
> 16KiB 1.81M/s

As I mentioned, the beginning of the drive is likely different from the rest, and deals differently with
random I/O to optimize for the FAT file system. That's why I suggested using 17MB offset rather than 1MB.

> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[2*1024*1024] --erasesize=$[4*1024*1024]
> 4MiB 14M/s
> 2MiB 13M/s
> 1MiB 15.1M/s
> 512KiB 3.25M/s
> 256KiB 6.58M/s
> 128KiB 4.18M/s
> 64KiB 13.5M/s
> 32KiB 2.13M/s
> 16KiB 1.82M/s
>
>
> So this does seem to me that the device quite likes 4 MiB sized, but doesn´t
> care too much about their alignment?

I think we can assume that any I/O over 1MB is fast within the first few MB of the device
based on these results.

> merkaba:~> /tmp/flashbench /dev/sdb --open-au --open-au-nr=1 --offset $[78*1024] --erasesize=$[4*1024*1024]
> 4MiB 14.2M/s
> 2MiB 13.3M/s
> 1MiB 15.1M/s
> 512KiB 3.42M/s
> 256KiB 6.6M/s
> 128KiB 4.22M/s
> 64KiB 13.5M/s
> 32KiB 2.17M/s
> 16KiB 1.84M/s
>
> Its seem thats a kinda special USB stick.

Having fast 64KB I/O is also quite common, but I agree that it's not obviously
showing the patterns that we expect for f2fs. Especially the 2 erase block limit
is problematic. If this is still the 2GB stick, it may be more helpful to play
with a different one that is larger. Many manufacturers have changed their
underlying technology between 2GB and 4GB (even more so in SD cards), and the
newer devices are more interesting because any small ones will soon be
gone from the market.

Arnd

2012-11-23 00:23:34

[permalink] [raw]

Subject: util-linux bug: was Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Sun, 11 Nov 2012 20:42:48 +0900 Jaegeuk Kim <[email protected]> wrote:

> 2012-11-11 (일), 00:55 +0300, Vyacheslav Dubeyko:
> > Hi,
> >
> > On Nov 10, 2012, at 9:33 PM, Martin Steigerwald wrote:
> >
> > [snip]
> > >
> > > merkaba:~> mkfs.f2fs /dev/sdb1
> > > Info: sector size = 512
> > > Info: total sectors = 4093951 (in 512bytes)
> > > Info: zone aligned segment0 blkaddr: 256
> > > Info: This device doesn't support TRIM
> > > Info: format successful
> > > merkaba:~> mount /dev/sdb1 /mnt/zeit
> > > mount: you must specify the filesystem type
> > > merkaba:~#32> mount -t f2fs /dev/sdb1 /mnt/zeit
> > > merkaba:~> df -hT /mnt/zeit
> > > Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf
> > > /dev/sdb1 f2fs 2,0G 147M 1,8G 8% /mnt/zeit
> > >
> >
> > Do you really have trouble with f2fs mount without definition of filesystem type? If so, it is a bug.
>
> Could you explain this bug in mor)e detail?
> Or, can anyone comment this?
>
> Actually, I've looking at the *mount* in linux-utils.
> I suspect something does not support f2fs in linux-utils.
>

This is a bug in util-linux.

If 'blkid' doesn't recognise the filesystem, mount reads /etc/filesystems
and tries everything listed there. If that file ends with a '*', it should
read /proc/filesystems and try everything else listed that (that doesn't
start with 'nodev').

However it currently ignores the '*' and just adds everything
from /proc/filesystems. So the list it uses will have a '*' in the
middle, which will be before 'f2fs' appears.

When it tries each filesystem type in turn, it will abort if it get any error
other than EINVAL.
When it tries with filesystem type '*' it gets 'ENODEV' and so gives up.

The following patch (against git://github.com/karelzak/util-linux.git) fixes
it for me, though I suspect it should possible ignore 'ENODEV' too -
otherwise if /etc/filesystems contains an invalid filesystem name, it will
silently ignore all others.

As a quick hack, just add 'f2fs' to /etc/filesystems

NeilBrown

Signed-off-by: NeilBrown <[email protected]>
diff --git a/libmount/src/utils.c b/libmount/src/utils.c
index 624633d..054f266 100644
--- a/libmount/src/utils.c
+++ b/libmount/src/utils.c
@@ -462,6 +462,10 @@ static int get_filesystems(const char *filename, char ***filesystems, const char
continue;
if (sscanf(line, " %128[^\n ]\n", name) != 1)
continue;
+ if (strcmp(name, "*") == 0) {
+ rc = 1;
+ break;
+ }
if (pattern && !mnt_match_fstype(name, pattern))
continue;
rc = add_filesystem(filesystems, name);
@@ -486,7 +490,7 @@ int mnt_get_filesystems(char ***filesystems, const char *pattern)
*filesystems = NULL;

rc = get_filesystems(_PATH_FILESYSTEMS, filesystems, pattern);
- if (rc)
+ if (rc != 1)
return rc;
return get_filesystems(_PATH_PROC_FILESYSTEMS, filesystems, pattern);
}

Attachments:

signature.asc (828.00 B)

2012-11-26 13:27:56

[permalink] [raw]

Subject: Re: util-linux bug: was Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Fri, Nov 23, 2012 at 11:23:09AM +1100, NeilBrown wrote:
> This is a bug in util-linux.
>
> If 'blkid' doesn't recognise the filesystem, mount reads /etc/filesystems
> and tries everything listed there. If that file ends with a '*', it should
> read /proc/filesystems and try everything else listed that (that doesn't
> start with 'nodev').
>
> However it currently ignores the '*' and just adds everything

Fixed. Thanks for the report!

> The following patch (against git://github.com/karelzak/util-linux.git) fixes
> it for me, though I suspect it should possible ignore 'ENODEV' too -
> otherwise if /etc/filesystems contains an invalid filesystem name, it will
> silently ignore all others.

Good idea, implemented.

... all will be fixed in stable v2.22.2 release.

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2012-11-26 21:06:56

[permalink] [raw]

Subject: Re: util-linux bug: was Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

On Mon, 26 Nov 2012 14:27:00 +0100 Karel Zak <[email protected]> wrote:

> On Fri, Nov 23, 2012 at 11:23:09AM +1100, NeilBrown wrote:
> > This is a bug in util-linux.
> >
> > If 'blkid' doesn't recognise the filesystem, mount reads /etc/filesystems
> > and tries everything listed there. If that file ends with a '*', it should
> > read /proc/filesystems and try everything else listed that (that doesn't
> > start with 'nodev').
> >
> > However it currently ignores the '*' and just adds everything
>
> Fixed. Thanks for the report!
>
> > The following patch (against git://github.com/karelzak/util-linux.git) fixes
> > it for me, though I suspect it should possible ignore 'ENODEV' too -
> > otherwise if /etc/filesystems contains an invalid filesystem name, it will
> > silently ignore all others.
>
> Good idea, implemented.
>
> ... all will be fixed in stable v2.22.2 release.
>
> Karel
>

Looks good,
thanks,
NeilBrown

Attachments:

signature.asc (828.00 B)