2009-01-17 02:30:11

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

We've noticed that at times it can become very easy to have a system begin to
livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
a lot of dentries are getting finalized at the same time (massive delete and
large fdtable destructions are two paths I've seen cause problems).

This patchset is an attempt to try and reduce the locking overheads associated
with final dput() and final iput(). This is done by batching dentries and
inodes into per-process queues and processing them in 'parallel' to consolidate
some of the locking.

Besides various workload testing, I threw together a load (at the end of this
email) that causes massive fdtables (50K sockets by default) to get destroyed
on each cpu in the system. It also populates the dcache for procfs on those
tasks for good measure. Comparing lock_stat results (hardware is a Sun x4600
M2 populated with 8 4-core 2.3GHz packages (32 CPUs) + 128GiB RAM):

BEFORE PATCHSET:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

dcache_lock: 5666485 10086172 11.16 2274036.47 3821690090.32 11042789 14926411 2.78 42138.28 65887138.46
-----------
dcache_lock 3016578 [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
dcache_lock 1844569 [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
dcache_lock 4223784 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
dcache_lock 207544 [<ffffffff810bc36a>] d_rehash+0x1b/0x43
-----------
dcache_lock 2305449 [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
dcache_lock 1954160 [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
dcache_lock 4169163 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
dcache_lock 866247 [<ffffffff810bc36a>] d_rehash+0x1b/0x43

...............................................................................................................................................................................................

inode_lock: 202813 203064 12.91 376655.85 37696597.26 5136095 8001156 3.37 15142.77 5033144.13
----------
inode_lock 20192 [<ffffffff810bfdbc>] new_inode+0x27/0x63
inode_lock 1 [<ffffffff810c6f58>] sync_inode+0x19/0x39
inode_lock 5 [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
inode_lock 2 [<ffffffff810c6e92>] __writeback_single_inode+0x1bf/0x26c
----------
inode_lock 20198 [<ffffffff810bfdbc>] new_inode+0x27/0x63
inode_lock 1 [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
inode_lock 5 [<ffffffff810c70fc>] generic_sync_sb_inodes+0x33/0x31a
inode_lock 165016 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c

...............................................................................................................................................................................................

&sbsec->isec_lock: 34428 34431 16.77 67593.54 3780131.15 1415432 3200261 4.06 13357.96 1180726.21






AFTER PATCHSET [Note that inode_lock doesn't even show up anymore]:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

dcache_lock: 2732103 5254824 11.04 1314926.88 2187466928.10 5551173 9751332 2.43 78012.24 42830806.29
-----------
dcache_lock 3015590 [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
dcache_lock 1966978 [<ffffffff810bd118>] d_instantiate+0x2c/0x51
dcache_lock 258657 [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
dcache_lock 30 [<ffffffff810b7ad4>] __link_path_walk+0x42d/0x665
-----------
dcache_lock 2493589 [<ffffffff810bd118>] d_instantiate+0x2c/0x51
dcache_lock 1862921 [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
dcache_lock 882054 [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
dcache_lock 15504 [<ffffffff810bc5e1>] process_postponed_dentries+0x3e/0x29f

...............................................................................................................................................................................................


ChangeLog:
[v1] - Jan 16, 2009
- Initial version

Patches in series:

1) Deferred batching of dput()
2) Parallel dput()
3) Deferred batching of iput()
4) Fixing iput called from put_super path
5) Parallelize iput()
6) hugetlbfs drop_inode update
7) Make drop_caches flush pending dput()s and iput()s
8) Make the sync path drain dentries and inodes

Caveats:

* It's not clear to me how to make 'sync(2)' do what it should. Currently we
flush pending dputs and iputs before writing anything out, but presumably,
there may be hidden iputs in the do_sync path that really ought to be
synchronous.
* I don't like the changes in patch 4 and it should probably be changed to
have the ->put_super callback paths call a synchronous version of iput()
(perhaps sync_iput()?) instead of having to tag the inodes with S_SYNCIPUT.
* cramfs, gfs2, ocfs2 need to be updated for the new drop_inode semantics.
---

fs/block_dev.c | 2
fs/dcache.c | 374 +++++++++++++++++++++++++++++++++++++----
fs/drop_caches.c | 1
fs/ext2/super.c | 2
fs/gfs2/glock.c | 2
fs/gfs2/ops_fstype.c | 2
fs/hugetlbfs/inode.c | 72 +++++---
fs/inode.c | 441 +++++++++++++++++++++++++++++++++++++++++-------
fs/ntfs/super.c | 2
fs/smbfs/inode.c | 2
fs/super.c | 6 -
fs/sync.c | 5 +
include/linux/dcache.h | 1
include/linux/fs.h | 18 ++
14 files changed, 787 insertions(+), 143 deletions(-)

--
// Copyright 2008 Google Inc. All Rights Reserved.
// Author: [email protected] (Mike Waychison)
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.
//
// Description: Meant to make a kernel go nuts by stressing dcache with
// implosion.

#include <sys/mman.h>
#include <sys/resource.h>
#include <sys/select.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <string>

using std::cout;
using std::string;

static char *who_am_i;

static bool wait_on_kids = true;
static int files_per_cpu = 50000;
static int wait_before_slaughter = 0;

static const int kMagicSignal = SIGUSR2;

static const int kMaxCPUs = __NCPUBITS;

typedef struct { volatile int counter; } atomic_t;
#define LOCK_PREFIX "lock "
static inline void atomic_inc(atomic_t *v)
{
__asm__ __volatile__(
LOCK_PREFIX "incl %0"
:"=m" (v->counter)
:"m" (v->counter));
}
#define atomic_read(v) ((v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))

static void *shared_area;

static atomic_t *children_done;

#define Abort() DoAbort(__LINE__)
static void DoAbort(int line) {
cout << "Aborted at line " << line << " errno=" << errno << "\n";
exit(1);
}

static void InitSharedArea() {
shared_area = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, 0, NULL);
if (shared_area == MAP_FAILED)
Abort();
children_done = static_cast<atomic_t *>(shared_area);
atomic_set(children_done, 0);
}

static void get_cpu_bitmap(cpu_set_t *bitmap) {
CPU_ZERO(bitmap);
if (sched_getaffinity(0, sizeof(*bitmap), bitmap) < 0)
Abort();
}

static void run_child(int cpu) {
// Start by setting our own SIGTERM handler to avoid google3 fallout
signal(SIGTERM, SIG_DFL);

// Pin ourselves to the cpu
cpu_set_t cpumask;
CPU_ZERO(&cpumask);
CPU_SET(cpu, &cpumask);
if (sched_setaffinity(0, sizeof(cpumask), &cpumask) < 0)
Abort();

// Start off by creating tons of open sockets. Well just make dummy sockets
// for now.
printf("Making sockets\n");
for (int i = 0; i < files_per_cpu; i++) {
if (socket(PF_INET, SOCK_STREAM, 0) < 0) {
printf("Ran out of file descriptors? errno == %d\n", errno);
kill(getppid(), SIGABRT);
exit(0);
}
}
printf("Done making sockets\n");

// Now populate the proc_inode_cache
pid_t my_pid = getpid();
char fd_path[100];
sprintf(fd_path, "/proc/%d/fd", my_pid);
struct stat stat_dat2;
if (lstat(fd_path, &stat_dat2) < 0)
Abort();
cout << "Mode : " << stat_dat2.st_mode << "\n";
DIR *dir = opendir(fd_path);
if (dir == static_cast<DIR *>(NULL))
Abort();

printf("Done opening fd directory\n");
dirent *dirent;
while ((dirent = readdir(dir)) != NULL) {
// Now stat it to make sure it gets populated in dcache
char file_path[100];
sprintf(file_path, "/proc/%d/fd/%s", my_pid, dirent->d_name);
struct stat stat_dat;
if (lstat(file_path, &stat_dat) < 0)
Abort();
}
closedir(dir);
printf("Done looking at directory\n");

// Now tell the parent that we are done
atomic_inc(children_done);

// And wait around to die
for (;;) {
timeval tv = {1, 0};
// Poke the parent
if (atomic_read(children_done) != 0)
kill(getppid(), kMagicSignal);
select(0, NULL, NULL, NULL, &tv);
}

Abort();
}

static void handle_sigusr1(int signum) {
}

static void make_child(int cpu, pid_t *child_pids) {
// Set up signal so that we know when the child is done.
if (signal(kMagicSignal, handle_sigusr1) == SIG_ERR)
Abort();

pid_t pid;
if ((pid = fork()) < 0)
Abort();
if (pid == 0) {
run_child(cpu);
return;
}

// Parent
child_pids[cpu] = pid;

// Nothing left to do for this child
}

static void crank_up_filemax() {
int fd;
if ((fd = open("/proc/sys/fs/file-max", O_RDWR)) < 0)
Abort();
char msg[100];
sprintf(msg, "%d\n", files_per_cpu * kMaxCPUs);
write(fd, msg, strlen(msg));
close(fd);
}

static void UsageAndDie() {
cout << "Usage: Hurt a kernel with fd implosion\n";
cout << who_am_i << " [options]\n";
cout << " --dont_wait_on_kids : Exit before kids are dead\n";
cout << " --files_per_cpu <n> : How many files should be created per cpu"
" (default 50000)\n";
cout << " --wait_before_slaugther <secs> : Do we sleep before killing "
"children? (default 0)";
exit(1);
}

static int ParseInt(char *arg)
{
char *endp;
int ret;
ret = (int)strtol(arg, &endp, 0);
if (*endp) {
cout << "Argument '" << arg << "' not an integer\n";
UsageAndDie();
}
return ret;
}

static void ParseArgs(int argc, char **argv)
{
who_am_i = argv[0];
argc--;
argv++;

while (argc) {
if (!strcmp(argv[0], "--wait_on_kids")) {
if (argc < 2)
UsageAndDie();
wait_on_kids = ParseInt(argv[1]);
argc -= 2;
argv += 2;
continue;
}
if (!strcmp(argv[0], "--files_per_cpu")) {
if (argc < 2)
UsageAndDie();
files_per_cpu = ParseInt(argv[1]);
argc -= 2;
argv += 2;
continue;
}
if (!strcmp(argv[0], "--dont_wait_on_kids")) {
wait_on_kids = false;
argc -= 1;
argv += 1;
continue;
}
cout << "Don't understand option " << argv[0] << "\n";
UsageAndDie();
}
}
int main(int argc, char **argv) {
ParseArgs(argc, argv);
if (getuid() != 0) {
cout << "This program must be run as root\n";
Abort();
}

InitSharedArea();

crank_up_filemax();

// Up our count of allowable open files
struct rlimit new_limits = {files_per_cpu + 20,
files_per_cpu + 20};
if (setrlimit(RLIMIT_NOFILE, &new_limits) < 0)
Abort();

cpu_set_t cpu_bitmap;
get_cpu_bitmap(&cpu_bitmap);

pid_t child_pids[kMaxCPUs];

// Build all the children
int childCount = 0;
for (int i = 0; i < kMaxCPUs; i++) {
if (CPU_ISSET(i, &cpu_bitmap)) {
cout << "About to run child on cpu" << i << "\n";
make_child(i, child_pids);
childCount++;
}
}

// Wait for all the children to finish
for (;;) {
if (atomic_read(children_done) == childCount) {
atomic_set(children_done, 0);
break;
}
timeval tv = {1, 0};
select(0, NULL, NULL, NULL, &tv);
}

sleep(wait_before_slaughter);

cout << "About to start killing children\n";

timeval before_tv;
gettimeofday(&before_tv, NULL);

// Now kill all the children
for (int i = 0; i < kMaxCPUs; i++) {
if (CPU_ISSET(i, &cpu_bitmap)) {
kill(child_pids[i], SIGTERM);
}
}

// Wait on em
if (wait_on_kids) {
for (int i = 0; i < kMaxCPUs; i++) {
if (CPU_ISSET(i, &cpu_bitmap)) {
int status;
waitpid(child_pids[i], &status, 0);
}
}
}

timeval after_tv;
gettimeofday(&after_tv, NULL);

int seconds = after_tv.tv_sec - before_tv.tv_sec;
uint64_t millisecs = seconds * 1000;
millisecs += (after_tv.tv_usec - before_tv.tv_usec) / 1000;
seconds = millisecs / 1000;
millisecs %= 1000;

char buf[100];
sprintf(buf, "%d.%03" PRIu64, seconds, millisecs);
cout << "do_exit path for run took " << buf << " seconds\n";

return 0;
}


2009-01-17 02:30:58

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 1/8] Deferred batching of dput()

This patch adds the notion of postponed dputs to the VFS. We do this by
introducing struct postponed_dentries, a data structure that maintains a list
of dentries that are pending a final dput.

Each CPU gets an on-heap allocated postponed_dentries structure that is
protected by disabling pre-emption and ensuring that they are only ever
accessed from the respective CPU. When a queue gets full, we allocate a new
one to replace it and swap them atomically, afterwhich we release the previous
queue. In the case where we fail to allocate a new queue, we go down a slow
path and iterate processing a single dentry at a time until the queue is empty.

The structure itself has three lists embedded in it. We maintain:

- Dentries pending for dput.
- Dentries and their associated inodes pending dentry_iput.

We reuse the first list as we discover parents.

Currently, postponed dputs are still handled in a serialized fashion, but we
defer them into struct postponed_dentries. The lock consolidation will come in
a later patch.

Lastly, we introduce a way to flush any pending dput()s via dput_drain_all() to
ensure that all dentries are finalized before fs shutdown.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/dcache.c | 289 +++++++++++++++++++++++++++++++++++++++++++-----
fs/super.c | 2
include/linux/dcache.h | 1
3 files changed, 261 insertions(+), 31 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4547f66..ea6b8f0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
#include <linux/seqlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
+#include <linux/cpu.h>
#include "internal.h"

int sysctl_vfs_cache_pressure __read_mostly = 100;
@@ -182,6 +183,175 @@ static struct dentry *d_kill(struct dentry *dentry)
return parent;
}

+struct postponed_dentries {
+ unsigned size;
+ struct {
+ unsigned nr;
+ struct dentry **dentries;
+ } pending_dput;
+ struct {
+ unsigned nr;
+ struct dentry **dentries;
+ struct inode **inodes;
+ } pending_dentry_iput;
+};
+
+struct postponed_dentries_onstack {
+ struct postponed_dentries ppd;
+ struct dentry *dentry_pending_dput;
+ struct dentry *dentry_pending_dentry_iput;
+ struct inode *inode_pending_dentry_iput;
+};
+
+static struct postponed_dentries *init_ppd_onstack(
+ struct postponed_dentries_onstack *ppd_onstack)
+{
+ struct postponed_dentries *ppd;
+ ppd = &ppd_onstack->ppd;
+ ppd->size = 1;
+ ppd->pending_dput.nr = 0;
+ ppd->pending_dput.dentries = &ppd_onstack->dentry_pending_dput;
+ ppd->pending_dentry_iput.nr = 0;
+ ppd->pending_dentry_iput.dentries =
+ &ppd_onstack->dentry_pending_dentry_iput;
+ ppd->pending_dentry_iput.inodes =
+ &ppd_onstack->inode_pending_dentry_iput;
+ return ppd;
+}
+
+static unsigned postponed_dentries_per_page(void)
+{
+ return (PAGE_SIZE - sizeof(struct postponed_dentries)) /
+ (3 * sizeof(void *));
+}
+
+/* Allocate a postponed_dentries structure on the heap. */
+struct postponed_dentries *new_postponed_dentries(void)
+{
+ struct postponed_dentries *ppd;
+ struct page *page;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return NULL;
+
+ ppd = page_address(page);
+
+ /* Create an set of three arrays immediately after the structure. */
+ ppd->size = postponed_dentries_per_page();
+ ppd->pending_dput.nr = 0;
+ ppd->pending_dput.dentries = (struct dentry **)(ppd + 1);
+ ppd->pending_dentry_iput.nr = 0;
+ ppd->pending_dentry_iput.dentries =
+ ppd->pending_dput.dentries + ppd->size;
+ ppd->pending_dentry_iput.inodes = (struct inode **)
+ (ppd->pending_dentry_iput.dentries + ppd->size);
+
+ return ppd;
+}
+
+static int pending_dput_full(struct postponed_dentries *ppd)
+{
+ return ppd->pending_dput.nr == ppd->size;
+}
+
+static void add_pending_dput(struct postponed_dentries *ppd,
+ struct dentry *dentry)
+{
+ ppd->pending_dput.dentries[ppd->pending_dput.nr++] = dentry;
+}
+
+static DEFINE_PER_CPU(struct postponed_dentries *, postponed_dentries);
+
+static int initialize_postponed_dentries(long cpu)
+{
+ struct postponed_dentries **pppd = &per_cpu(postponed_dentries, cpu);
+ *pppd = new_postponed_dentries();
+ if (!*pppd)
+ return 1;
+ return 0;
+}
+
+static void process_postponed_dentries(struct postponed_dentries *ppd);
+static void release_postponed_dentries(struct postponed_dentries *ppd)
+{
+ process_postponed_dentries(ppd);
+ free_page((unsigned long)ppd);
+}
+
+static int __cpuinit cpuup_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (initialize_postponed_dentries(cpu))
+ return NOTIFY_STOP;
+ break;
+ case CPU_DEAD:
+ release_postponed_dentries(per_cpu(postponed_dentries, cpu));
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata dentry_put_cache_notifier = {
+ &cpuup_callback, NULL, 0
+};
+
+static void real_dput(struct dentry *dentry)
+{
+ /* Legacy: */
+repeat:
+ spin_lock(&dcache_lock);
+ if (atomic_dec_and_test(&dentry->d_count)) {
+ spin_unlock(&dcache_lock);
+ return;
+ }
+
+ spin_lock(&dentry->d_lock);
+ if (atomic_read(&dentry->d_count)) {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ return;
+ }
+
+ /*
+ * AV: ->d_delete() is _NOT_ allowed to block now.
+ */
+ if (dentry->d_op && dentry->d_op->d_delete) {
+ if (dentry->d_op->d_delete(dentry))
+ goto unhash_it;
+ }
+ /* Unreachable? Get rid of it */
+ if (d_unhashed(dentry))
+ goto kill_it;
+ if (list_empty(&dentry->d_lru)) {
+ dentry->d_flags |= DCACHE_REFERENCED;
+ dentry_lru_add(dentry);
+ }
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ return;
+
+unhash_it:
+ __d_drop(dentry);
+kill_it:
+ /* if dentry was on the d_lru list delete it from there */
+ dentry_lru_del(dentry);
+ dentry = d_kill(dentry);
+ if (dentry)
+ goto repeat;
+}
+
+static void process_postponed_dentries(struct postponed_dentries *ppd)
+{
+ unsigned i;
+
+ for (i = 0; i < ppd->pending_dput.nr; i++)
+ real_dput(ppd->pending_dput.dentries[i]);
+}
/*
* This is dput
*
@@ -199,6 +369,40 @@ static struct dentry *d_kill(struct dentry *dentry)
* Real recursion would eat up our stack space.
*/

+static void postpone_dput(struct dentry *dentry)
+{
+ struct postponed_dentries *ppd, *new_ppd;
+
+again:
+ ppd = get_cpu_var(postponed_dentries);
+ if (!pending_dput_full(ppd)) {
+ add_pending_dput(ppd, dentry);
+ put_cpu_var(postponed_dentries);
+ return;
+ }
+
+ /* need to flush out existing pending dentries. */
+ put_cpu_var(postponed_dentries);
+ /* Allocate more space.. */
+ new_ppd = new_postponed_dentries();
+ if (!new_ppd) {
+ /* Take the slow path, memory is low */
+ struct postponed_dentries_onstack ppd_onstack;
+ struct postponed_dentries *ppd;
+
+ ppd = init_ppd_onstack(&ppd_onstack);
+ add_pending_dput(ppd, dentry);
+ process_postponed_dentries(ppd);
+ return;
+ }
+ ppd = get_cpu_var(postponed_dentries);
+ __get_cpu_var(postponed_dentries) = new_ppd;
+ put_cpu_var(postponed_dentries);
+ process_postponed_dentries(ppd);
+ goto again;
+}
+
+
/*
* dput - release a dentry
* @dentry: dentry to release
@@ -216,45 +420,62 @@ void dput(struct dentry *dentry)
if (!dentry)
return;

-repeat:
if (atomic_read(&dentry->d_count) == 1)
might_sleep();
- if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
+ /* Decrement the count unless we would hit zero */
+ if (atomic_add_unless(&dentry->d_count, -1, 1))
return;
+ postpone_dput(dentry);
+}

- spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count)) {
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- return;
+/**
+ * dput_drain_slowpath - drain out the postponed dentries on this cpu
+ *
+ * Iterates through and loops until there are no dentries pending dput on the
+ * invoked CPU. Must be called with pre-emption disabled, but may re-enable
+ * pre-emption. Returns with pre-emption disabled. Caller is required to
+ * ensure that this thread will not change CPUs in the meantime.
+ */
+static void dput_drain_slowpath(void)
+{
+ struct postponed_dentries *ppd;
+
+ ppd = __get_cpu_var(postponed_dentries);
+ while (ppd->pending_dput.nr) {
+ struct postponed_dentries_onstack ppd_onstack;
+ struct postponed_dentries *tmp_ppd;
+ struct dentry *dentry;
+
+ dentry = ppd->pending_dput.dentries[--ppd->pending_dput.nr];
+
+ tmp_ppd = init_ppd_onstack(&ppd_onstack);
+ add_pending_dput(tmp_ppd, dentry);
+ put_cpu_var(postponed_dentries);
+ process_postponed_dentries(tmp_ppd);
+ ppd = get_cpu_var(postponed_dentries);
}
+}

- /*
- * AV: ->d_delete() is _NOT_ allowed to block now.
- */
- if (dentry->d_op && dentry->d_op->d_delete) {
- if (dentry->d_op->d_delete(dentry))
- goto unhash_it;
+static void dput_drain_per_cpu(struct work_struct *dummy)
+{
+ struct postponed_dentries *ppd, *new_ppd;
+
+ new_ppd = new_postponed_dentries();
+
+ ppd = get_cpu_var(postponed_dentries);
+ if (new_ppd) {
+ __get_cpu_var(postponed_dentries) = new_ppd;
+ put_cpu_var(postponed_dentries);
+ release_postponed_dentries(ppd);
+ } else {
+ dput_drain_slowpath();
+ put_cpu_var(postponed_dentries);
}
- /* Unreachable? Get rid of it */
- if (d_unhashed(dentry))
- goto kill_it;
- if (list_empty(&dentry->d_lru)) {
- dentry->d_flags |= DCACHE_REFERENCED;
- dentry_lru_add(dentry);
- }
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- return;
+}

-unhash_it:
- __d_drop(dentry);
-kill_it:
- /* if dentry was on the d_lru list delete it from there */
- dentry_lru_del(dentry);
- dentry = d_kill(dentry);
- if (dentry)
- goto repeat;
+void dput_drain_all(void)
+{
+ schedule_on_each_cpu(dput_drain_per_cpu);
}

/**
@@ -2321,6 +2542,7 @@ void __init vfs_caches_init_early(void)
void __init vfs_caches_init(unsigned long mempages)
{
unsigned long reserve;
+ long cpu;

/* Base hash sizes on available memory, with a reserve equal to
150% of current kernel size */
@@ -2337,6 +2559,11 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+
+ for_each_online_cpu(cpu)
+ if (initialize_postponed_dentries(cpu))
+ panic("Couldn't init postponed dentries\n");
+ register_cpu_notifier(&dentry_put_cache_notifier);
}

EXPORT_SYMBOL(d_alloc);
diff --git a/fs/super.c b/fs/super.c
index ed080c4..534840f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -292,6 +292,8 @@ void generic_shutdown_super(struct super_block *sb)
const struct super_operations *sop = sb->s_op;


+ dput_drain_all();
+
if (sb->s_root) {
shrink_dcache_for_umount(sb);
fsync_super(sb);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c66d224..c9f7c95 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -362,6 +362,7 @@ static inline struct dentry *dget_parent(struct dentry *dentry)
}

extern void dput(struct dentry *);
+void dput_drain_all(void);

static inline int d_mountpoint(struct dentry *dentry)
{

2009-01-17 02:31:39

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 2/8] Parallel dput()

Implementation of batched dput() which only grabs the dcache_lock once per
batch of dentries.

process_postponed_dentries() now operates on dentries in two phases. The first
phase is done with dcache_lock held, in which postponed_dput() will take care
of transition the dentry to the LRU, dropping it or killing it. Note that we
have to enqueue both the dentry and inode seperately as the dentry becomes
negative under lock.

Parents are enqueued and processed immediately because we wish to avoid
recursion. When enqueing denries for the second phase, we enqueue the child
dentry on a second list (postponed_dentries->pending_dentry_iput.dentries) as
we re-use the original list (postponed_dentries->pending_dput.dentries) as we
discover parents requiring dput.

The second phase handles what is left over from the original dentry_iput
(namely fsnotify, iput and dfree).

Signed-off-by: Mike Waychison <[email protected]>
---

fs/dcache.c | 156 +++++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 115 insertions(+), 41 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ea6b8f0..9760bdb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -157,32 +157,6 @@ static void dentry_lru_del_init(struct dentry *dentry)
}
}

-/**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- */
-static struct dentry *d_kill(struct dentry *dentry)
- __releases(dentry->d_lock)
- __releases(dcache_lock)
-{
- struct dentry *parent;
-
- list_del(&dentry->d_u.d_child);
- dentry_stat.nr_dentry--; /* For d_free, below */
- /*drops the locks, at that point nobody can reach this dentry */
- dentry_iput(dentry);
- if (IS_ROOT(dentry))
- parent = NULL;
- else
- parent = dentry->d_parent;
- d_free(dentry);
- return parent;
-}
-
struct postponed_dentries {
unsigned size;
struct {
@@ -261,6 +235,15 @@ static void add_pending_dput(struct postponed_dentries *ppd,
ppd->pending_dput.dentries[ppd->pending_dput.nr++] = dentry;
}

+static void add_pending_dentry_iput(struct postponed_dentries *ppd,
+ struct dentry *dentry,
+ struct inode *inode)
+{
+ unsigned nr = ppd->pending_dentry_iput.nr++;
+ ppd->pending_dentry_iput.dentries[nr] = dentry;
+ ppd->pending_dentry_iput.inodes[nr] = inode;
+}
+
static DEFINE_PER_CPU(struct postponed_dentries *, postponed_dentries);

static int initialize_postponed_dentries(long cpu)
@@ -300,20 +283,69 @@ static struct notifier_block __cpuinitdata dentry_put_cache_notifier = {
&cpuup_callback, NULL, 0
};

-static void real_dput(struct dentry *dentry)
+/**
+ * d_kill - kill dentry and return parent
+ * @dentry: dentry to kill
+ *
+ * The dentry must already be unhashed and removed from the LRU.
+ *
+ * If this is the root of the dentry tree, return NULL.
+ */
+static struct dentry *d_kill(struct dentry *dentry)
+ __releases(dentry->d_lock)
+ __releases(dcache_lock)
{
- /* Legacy: */
-repeat:
- spin_lock(&dcache_lock);
- if (atomic_dec_and_test(&dentry->d_count)) {
- spin_unlock(&dcache_lock);
- return;
+ struct dentry *parent;
+
+ list_del(&dentry->d_u.d_child);
+ dentry_stat.nr_dentry--; /* For d_free, below */
+ /*drops the locks, at that point nobody can reach this dentry */
+ dentry_iput(dentry);
+ parent = dentry->d_parent;
+ d_free(dentry);
+ return dentry == parent ? NULL : parent;
+}
+
+/**
+ * d_kill_batched - kill dentry and add parent to delayed batch.
+ * @ppd: postponed dentries to enqueue dentry_iput and parent if any.
+ * @dentry: dentry to kill
+ *
+ * The dentry must already be unhashed and removed from the LRU.
+ *
+ * Called with dentry->d_lock and dcache_lock held. Drops dentry->d_lock.
+ */
+static void d_kill_batched(struct postponed_dentries *ppd,
+ struct dentry *dentry)
+ __releases(dentry->d_lock)
+{
+ struct inode *inode = dentry->d_inode;
+ struct dentry *parent;
+
+ list_del(&dentry->d_u.d_child);
+ dentry_stat.nr_dentry--; /* For d_free, below */
+ /* Open coded first half to dentry_iput */
+ if (inode) {
+ dentry->d_inode = NULL;
+ list_del_init(&dentry->d_alias);
}
+ spin_unlock(&dentry->d_lock);
+ add_pending_dentry_iput(ppd, dentry, inode);
+ parent = dentry->d_parent;
+ if (parent && parent != dentry)
+ add_pending_dput(ppd, parent);
+}
+
+static void postponed_dput(struct postponed_dentries *ppd,
+ struct dentry *dentry)
+{
+ if (!atomic_dec_and_test(&dentry->d_count))
+ return;

spin_lock(&dentry->d_lock);
- if (atomic_read(&dentry->d_count)) {
+ if (atomic_read(&dentry->d_count) != 0) {
+ /* Raced -- dentry still in use */
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return;
}

@@ -332,7 +364,6 @@ repeat:
dentry_lru_add(dentry);
}
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
return;

unhash_it:
@@ -340,18 +371,61 @@ unhash_it:
kill_it:
/* if dentry was on the d_lru list delete it from there */
dentry_lru_del(dentry);
- dentry = d_kill(dentry);
- if (dentry)
- goto repeat;
+ d_kill_batched(ppd, dentry);
+}
+
+static void postponed_dentry_iput(struct postponed_dentries *ppd,
+ struct dentry *dentry,
+ struct inode *inode)
+{
+ if (inode) {
+ if (!inode->i_nlink)
+ fsnotify_inoderemove(inode);
+ if (dentry->d_op && dentry->d_op->d_iput)
+ dentry->d_op->d_iput(dentry, inode);
+ else
+ iput(inode);
+ }
+ d_free(dentry);
}

static void process_postponed_dentries(struct postponed_dentries *ppd)
{
unsigned i;
+ unsigned nr;

- for (i = 0; i < ppd->pending_dput.nr; i++)
- real_dput(ppd->pending_dput.dentries[i]);
+repeat:
+ /*
+ * Begin by making sure that the list of dentries queued for
+ * dentry_iput is empty.
+ */
+ ppd->pending_dentry_iput.nr = 0;
+
+ /*
+ * Save off the length of the queue of dentries queued as we re-use the
+ * array in-place as we discover parent dentries in the first pass.
+ */
+ nr = ppd->pending_dput.nr;
+ ppd->pending_dput.nr = 0;
+
+ /* First pass: we remove the dentries from the dcache. */
+ spin_lock(&dcache_lock);
+ for (i = 0; i < nr; i++)
+ postponed_dput(ppd, ppd->pending_dput.dentries[i]);
+ spin_unlock(&dcache_lock);
+
+ /* Second pass: Go through and process anything pending dentry_iput */
+ for (i = 0; i < ppd->pending_dentry_iput.nr; i++) {
+ postponed_dentry_iput(ppd,
+ ppd->pending_dentry_iput.dentries[i],
+ ppd->pending_dentry_iput.inodes[i]);
+ }
+
+ /* Did we discover any parent dentries? */
+ if (ppd->pending_dput.nr)
+ goto repeat;
}
+
/*
* This is dput
*

2009-01-17 02:32:19

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 3/8] Deferred batching of iput()

This patch adds the postponed_inodes structure which is used to batch iputs so
that they can later be processed in parallel.

Each CPU gets an on-heap allocated postponed_inodes structure that is protected
by disabling pre-emption and ensuring that they are only ever accessed from
their respective CPU.

The idea is to enqueue inodes until we fill a postponed_inodes allocated on the
heap. If we then see a postponed_inodes queue that is full, we try to allocate
a new one to replace it and process the full queue. We handle page allocation
errors by processing pending inodes in place using a dummy on-stack structure
(iput_drain_slowpath).

Draining of the pending iput()s is tacked onto the draining of pending dput()s
instead of being kept seperate so that we don't have to schedule work on all
CPUs twice.

Processing of the actual final iput is still serial and will be addressed in a
later patch.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/dcache.c | 1
fs/inode.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2
3 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9760bdb..2c2ac62 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -545,6 +545,7 @@ static void dput_drain_per_cpu(struct work_struct *dummy)
dput_drain_slowpath();
put_cpu_var(postponed_dentries);
}
+ iput_drain_cpu();
}

void dput_drain_all(void)
diff --git a/fs/inode.c b/fs/inode.c
index 913ab2d..56cf6e2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -23,6 +23,7 @@
#include <linux/inotify.h>
#include <linux/mount.h>
#include <linux/async.h>
+#include <linux/cpu.h>

/*
* This is needed for the following functions:
@@ -384,6 +385,13 @@ int invalidate_inodes(struct super_block * sb)
LIST_HEAD(throw_away);

mutex_lock(&iprune_mutex);
+
+ /* We have to drain any pending iput()s as they will affect whether we
+ * see inodes as busy or not (consider inodes put in put_super()). We
+ * also have to do this while under iprune_mutex otherwise
+ * shrink_icache_memory may stumble in and race with us. */
+ iput_drain_all();
+
spin_lock(&inode_lock);
inotify_unmount_inodes(&sb->s_inodes);
busy = invalidate_list(&sb->s_inodes, &throw_away);
@@ -1242,6 +1250,114 @@ static inline void iput_final(struct inode *inode)
drop(inode);
}

+static void real_iput(struct inode *inode)
+{
+ if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+ iput_final(inode);
+}
+
+struct postponed_inodes {
+ unsigned size;
+ unsigned nr;
+ struct inode **inodes;
+};
+
+static DEFINE_PER_CPU(struct postponed_inodes *, postponed_inodes);
+
+static unsigned postponed_inodes_per_page(void)
+{
+ return (PAGE_SIZE - sizeof(struct postponed_inodes)) /
+ (sizeof(struct inode *));
+}
+
+static struct postponed_inodes *new_postponed_inodes(void)
+{
+ struct postponed_inodes *ppi;
+ struct page *page;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return NULL;
+
+ ppi = page_address(page);
+
+ ppi->size = postponed_inodes_per_page();
+ ppi->nr = 0;
+ ppi->inodes = (struct inode **)(ppi + 1);
+
+ return ppi;
+}
+
+struct postponed_inodes_onstack {
+ struct postponed_inodes ppi;
+ struct inode *inode;
+};
+
+static struct postponed_inodes *init_ppi_onstack(
+ struct postponed_inodes_onstack *ppi_onstack)
+{
+ struct postponed_inodes *ppi;
+ ppi = &ppi_onstack->ppi;
+ ppi->size = 1;
+ ppi->nr = 0;
+ ppi->inodes = &ppi_onstack->inode;
+ return ppi;
+}
+
+static int pending_iput_full(struct postponed_inodes *ppi)
+{
+ return ppi->nr == ppi->size;
+}
+
+static void add_pending_iput(struct postponed_inodes *ppi, struct inode *inode)
+{
+ ppi->inodes[ppi->nr++] = inode;
+}
+
+static void process_postponed_inodes(struct postponed_inodes *ppi)
+{
+ unsigned i;
+ for (i = 0; i < ppi->nr; i++)
+ real_iput(ppi->inodes[i]);
+}
+
+static void release_postponed_inodes(struct postponed_inodes *ppi)
+{
+ process_postponed_inodes(ppi);
+ free_page((unsigned long)ppi);
+}
+
+static void postpone_iput(struct inode *inode)
+{
+ struct postponed_inodes *ppi, *new_ppi;
+
+again:
+ ppi = get_cpu_var(postponed_inodes);
+ if (!pending_iput_full(ppi)) {
+ add_pending_iput(ppi, inode);
+ put_cpu_var(postponed_inodes);
+ return;
+ }
+
+ /* Need to flush out existing pending inodes */
+ put_cpu_var(postponed_inodes);
+ /* Allocate more space.. */
+ new_ppi = new_postponed_inodes();
+ if (!new_ppi) {
+ struct postponed_inodes_onstack ppi_onstack;
+
+ ppi = init_ppi_onstack(&ppi_onstack);
+ add_pending_iput(ppi, inode);
+ process_postponed_inodes(ppi);
+ return;
+ }
+ ppi = get_cpu_var(postponed_inodes);
+ __get_cpu_var(postponed_inodes) = new_ppi;
+ put_cpu_var(postponed_inodes);
+ process_postponed_inodes(ppi);
+ goto again;
+}
+
/**
* iput - put an inode
* @inode: inode to put
@@ -1256,8 +1372,8 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

- if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
- iput_final(inode);
+ if (!atomic_add_unless(&inode->i_count, -1, 1))
+ postpone_iput(inode);
}
}

@@ -1468,6 +1584,88 @@ static int __init set_ihash_entries(char *str)
}
__setup("ihash_entries=", set_ihash_entries);

+static int __cpuinit cpuup_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ struct postponed_inodes **pppi = &per_cpu(postponed_inodes, cpu);
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ *pppi = new_postponed_inodes();
+ if (*pppi == NULL)
+ return NOTIFY_STOP;
+ break;
+ case CPU_DEAD:
+ release_postponed_inodes(*pppi);
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata inode_put_cache_notifier = {
+ &cpuup_callback, NULL, 0
+};
+
+/**
+ * iput_drain_slowpath - drain out the postponed inodes on this cpu
+ *
+ * Iterates through and loops until all pending inodes on this cpu are iput.
+ * Must be called with pre-emption disabled, but may re-enable pre-emtion.
+ * Returns with pre-emption disabled. Caller is required to ensure that this
+ * thread is affine to the cpu.
+ */
+static void iput_drain_cpu_slowpath(void)
+{
+ struct postponed_inodes *ppi;
+
+ ppi = __get_cpu_var(postponed_inodes);
+ while (ppi->nr) {
+ struct postponed_inodes_onstack ppi_onstack;
+ struct postponed_inodes *tmp_ppi;
+ struct inode *inode;
+
+ inode = ppi->inodes[--ppi->nr];
+
+ tmp_ppi = init_ppi_onstack(&ppi_onstack);
+ add_pending_iput(ppi, inode);
+ put_cpu_var(postponed_inodes);
+ process_postponed_inodes(ppi);
+ ppi = get_cpu_var(postponed_inodes);
+ }
+}
+
+/**
+ * iput_drain_cpu - Drain all pending iputs on this CPU
+ *
+ * Must be called affine to the CPU.
+ */
+void iput_drain_cpu(void)
+{
+ struct postponed_inodes *ppi, *new_ppi;
+
+ new_ppi = new_postponed_inodes();
+ ppi = get_cpu_var(postponed_inodes);
+ if (new_ppi) {
+ __get_cpu_var(postponed_inodes) = new_ppi;
+ put_cpu_var(postponed_inodes);
+ release_postponed_inodes(ppi);
+ } else {
+ iput_drain_cpu_slowpath();
+ put_cpu_var(postponed_inodes);
+ }
+}
+
+static void iput_drain_per_cpu_cb(struct work_struct *dummy)
+{
+ iput_drain_cpu();
+}
+
+void iput_drain_all(void)
+{
+ schedule_on_each_cpu(iput_drain_per_cpu_cb);
+}
+
/*
* Initialize the waitqueues and inode hash table.
*/
@@ -1498,6 +1696,7 @@ void __init inode_init_early(void)
void __init inode_init(void)
{
int loop;
+ long cpu;

/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache",
@@ -1508,6 +1707,14 @@ void __init inode_init(void)
init_once);
register_shrinker(&icache_shrinker);

+ for_each_online_cpu(cpu) {
+ struct postponed_inodes *ppi = new_postponed_inodes();
+ if (!ppi)
+ panic("Couldn't allocate initial iput caches\n");
+ per_cpu(postponed_inodes, cpu) = ppi;
+ }
+ register_cpu_notifier(&inode_put_cache_notifier);
+
/* Hash may have been set up in inode_init_early */
if (!hashdist)
return;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6022f44..47eaa71 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1908,6 +1908,8 @@ extern ino_t iunique(struct super_block *, ino_t);
extern int inode_needs_sync(struct inode *inode);
extern void generic_delete_inode(struct inode *inode);
extern void generic_drop_inode(struct inode *inode);
+extern void iput_drain_cpu(void);
+extern void iput_drain_all(void);

extern struct inode *ilookup5_nowait(struct super_block *sb,
unsigned long hashval, int (*test)(struct inode *, void *),

2009-01-17 02:32:49

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 4/8] Fixing iput() called from put_super path

Delaying iput introduces a problem when it comes to unmounting. The problem
lies in fs-specific put_super callbacks that will issue iput()s and assume that
the iputs on their final inodes (for metadata inodes and the like) which causes
problems with deferred iput() as the super_block may go away before the inodes
are finalized.

This patch goes and marks all inodes in a super_block whenever we see that the
inodes are being invalidated with S_SYNCIPUT, which causes us to shortcircuit
the deferred iput and perform a synchronous iput.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/block_dev.c | 2 +-
fs/ext2/super.c | 2 +-
fs/gfs2/glock.c | 2 +-
fs/gfs2/ops_fstype.c | 2 +-
fs/inode.c | 19 ++++++++++++++-----
fs/ntfs/super.c | 2 +-
fs/smbfs/inode.c | 2 +-
fs/super.c | 4 ++--
include/linux/fs.h | 3 ++-
9 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b3c1eff..eb6517e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1400,7 +1400,7 @@ int __invalidate_device(struct block_device *bdev)
* hold).
*/
shrink_dcache_sb(sb);
- res = invalidate_inodes(sb);
+ res = invalidate_inodes(sb, 1);
drop_super(sb);
}
invalidate_bdev(bdev);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index da8bdea..190dfc9 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1185,7 +1185,7 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
es = sbi->s_es;
if (((sbi->s_mount_opt & EXT2_MOUNT_XIP) !=
(old_mount_opt & EXT2_MOUNT_XIP)) &&
- invalidate_inodes(sb))
+ invalidate_inodes(sb, 0))
ext2_warning(sb, __func__, "busy inodes while remounting "\
"xip remain in cache (no functional problem)");
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY))
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6b983ae..4d95373 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1574,7 +1574,7 @@ void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
}

down_write(&gfs2_umount_flush_sem);
- invalidate_inodes(sdp->sd_vfs);
+ invalidate_inodes(sdp->sd_vfs, 1);
up_write(&gfs2_umount_flush_sem);
msleep(10);
}
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index f91eebd..5099ca5 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1205,7 +1205,7 @@ fail_locking:
fail_lm:
gfs2_gl_hash_clear(sdp);
gfs2_lm_unmount(sdp);
- while (invalidate_inodes(sb))
+ while (invalidate_inodes(sb, 1))
yield();
fail_sys:
gfs2_sys_fs_del(sdp);
diff --git a/fs/inode.c b/fs/inode.c
index 56cf6e2..9aa574d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -335,7 +335,8 @@ static void dispose_list(struct list_head *head)
/*
* Invalidate all inodes for a device.
*/
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_list(struct list_head *head, struct list_head *dispose,
+ int umount)
{
struct list_head *next;
int busy = 0, count = 0;
@@ -358,6 +359,8 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
break;
inode = list_entry(tmp, struct inode, i_sb_list);
invalidate_inode_buffers(inode);
+ if (umount)
+ inode->i_flags |= S_SYNCIPUT;
if (!atomic_read(&inode->i_count)) {
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
@@ -374,12 +377,13 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
/**
* invalidate_inodes - discard the inodes on a device
* @sb: superblock
+ * @umount: are we unmounting?
*
* Discard all of the inodes for a given superblock. If the discard
* fails because there are busy inodes then a non zero value is returned.
* If the discard is successful all the inodes have been discarded.
*/
-int invalidate_inodes(struct super_block * sb)
+int invalidate_inodes(struct super_block *sb, int umount)
{
int busy;
LIST_HEAD(throw_away);
@@ -394,7 +398,7 @@ int invalidate_inodes(struct super_block * sb)

spin_lock(&inode_lock);
inotify_unmount_inodes(&sb->s_inodes);
- busy = invalidate_list(&sb->s_inodes, &throw_away);
+ busy = invalidate_list(&sb->s_inodes, &throw_away, umount);
spin_unlock(&inode_lock);

dispose_list(&throw_away);
@@ -1372,8 +1376,12 @@ void iput(struct inode *inode)
if (inode) {
BUG_ON(inode->i_state == I_CLEAR);

- if (!atomic_add_unless(&inode->i_count, -1, 1))
- postpone_iput(inode);
+ if (!(inode->i_flags & S_SYNCIPUT)) {
+ if (!atomic_add_unless(&inode->i_count, -1, 1))
+ postpone_iput(inode);
+ } else {
+ real_iput(inode);
+ }
}
}

@@ -1663,6 +1671,7 @@ static void iput_drain_per_cpu_cb(struct work_struct *dummy)

void iput_drain_all(void)
{
+ /* TODO: error checking? */
schedule_on_each_cpu(iput_drain_per_cpu_cb);
}

diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 4a46743..04220f3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -3051,7 +3051,7 @@ iput_tmp_ino_err_out_now:
* method again... FIXME: Do we need to do this twice now because of
* attribute inodes? I think not, so leave as is for now... (AIA)
*/
- if (invalidate_inodes(sb)) {
+ if (invalidate_inodes(sb, 1)) {
ntfs_error(sb, "Busy inodes left. This is most likely a NTFS "
"driver bug.");
/* Copied from fs/super.c. I just love this message. (-; */
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index fc27fbf..4def57a 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -229,7 +229,7 @@ smb_invalidate_inodes(struct smb_sb_info *server)
{
VERBOSE("\n");
shrink_dcache_sb(SB_of(server));
- invalidate_inodes(SB_of(server));
+ invalidate_inodes(SB_of(server), 1);
}

/*
diff --git a/fs/super.c b/fs/super.c
index 534840f..244b6f5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -306,7 +306,7 @@ void generic_shutdown_super(struct super_block *sb)
async_synchronize_full_special(&sb->s_async_list);

/* bad name - it should be evict_inodes() */
- invalidate_inodes(sb);
+ invalidate_inodes(sb, 1);
lock_kernel();

if (sop->write_super && sb->s_dirt)
@@ -315,7 +315,7 @@ void generic_shutdown_super(struct super_block *sb)
sop->put_super(sb);

/* Forget any remaining inodes */
- if (invalidate_inodes(sb)) {
+ if (invalidate_inodes(sb, 1)) {
printk("VFS: Busy inodes after unmount of %s. "
"Self-destruct in 5 seconds. Have a nice day...\n",
sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 47eaa71..aa90620 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -161,6 +161,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_SYNCIPUT 1024 /* Filesystem going down, use sync iput() */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1805,7 +1806,7 @@ extern int check_disk_change(struct block_device *);
extern int __invalidate_device(struct block_device *);
extern int invalidate_partition(struct gendisk *, int);
#endif
-extern int invalidate_inodes(struct super_block *);
+extern int invalidate_inodes(struct super_block *, int umount);
unsigned long __invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end,
bool be_atomic);

2009-01-17 02:33:24

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 5/8] Parallelize iput()

Now that we have deffered batching of iputs, we can parallelize the
finalization of inodes to minimize the grabbing of inode_lock.

iput actually grabs the inode_lock twice as it needs to transition the state of
the inode into a finalizing state (I_FREEING, I_WILLFREE) before it can start
flushing the inode out. It then needs to grab the lock again to unhash it
before releasing the lock and freeing the inode.

This patch turns the drop_inode callbacks in the VFS to this four phase
approach. I opted to introduce a phase enum and have the callbacks switch on
it rather than creating three new callbacks each.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/inode.c | 219 +++++++++++++++++++++++++++++++++++++---------------
include/linux/fs.h | 13 ++-
2 files changed, 167 insertions(+), 65 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9aa574d..e19c6ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1151,71 +1151,87 @@ EXPORT_SYMBOL(remove_inode_hash);
* I_FREEING is set so that no-one will take a new reference to the inode while
* it is being deleted.
*/
-void generic_delete_inode(struct inode *inode)
+int generic_delete_inode(struct inode *inode, enum drop_inode_phase phase)
{
const struct super_operations *op = inode->i_sb->s_op;

- list_del_init(&inode->i_list);
- list_del_init(&inode->i_sb_list);
- inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
- spin_unlock(&inode_lock);
-
- security_inode_delete(inode);
-
- if (op->delete_inode) {
- void (*delete)(struct inode *) = op->delete_inode;
- if (!is_bad_inode(inode))
- DQUOT_INIT(inode);
- /* Filesystems implementing their own
- * s_op->delete_inode are required to call
- * truncate_inode_pages and clear_inode()
- * internally */
- delete(inode);
- } else {
- truncate_inode_pages(&inode->i_data, 0);
- clear_inode(inode);
+ switch (phase) {
+ case DIP_START_FREEING:
+ list_del_init(&inode->i_list);
+ list_del_init(&inode->i_sb_list);
+ inode->i_state |= I_FREEING;
+ inodes_stat.nr_inodes--;
+ break;
+ case DIP_DELETING:
+ security_inode_delete(inode);
+ if (op->delete_inode) {
+ void (*delete)(struct inode *) = op->delete_inode;
+ if (!is_bad_inode(inode))
+ DQUOT_INIT(inode);
+ /* Filesystems implementing their own
+ * s_op->delete_inode are required to call
+ * truncate_inode_pages and clear_inode()
+ * internally */
+ delete(inode);
+ } else {
+ truncate_inode_pages(&inode->i_data, 0);
+ clear_inode(inode);
+ }
+ break;
+ case DIP_UNHASHING:
+ hlist_del_init(&inode->i_hash);
+ break;
+ case DIP_DESTROYING:
+ wake_up_inode(inode);
+ BUG_ON(inode->i_state != I_CLEAR);
+ destroy_inode(inode);
+ break;
}
- spin_lock(&inode_lock);
- hlist_del_init(&inode->i_hash);
- spin_unlock(&inode_lock);
- wake_up_inode(inode);
- BUG_ON(inode->i_state != I_CLEAR);
- destroy_inode(inode);
+ return 0;
}

EXPORT_SYMBOL(generic_delete_inode);

-static void generic_forget_inode(struct inode *inode)
+static int generic_forget_inode(struct inode *inode,
+ enum drop_inode_phase phase)
{
struct super_block *sb = inode->i_sb;

- if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
- list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
- if (sb->s_flags & MS_ACTIVE) {
- spin_unlock(&inode_lock);
- return;
+ switch (phase) {
+ case DIP_START_FREEING:
+ if (!hlist_unhashed(&inode->i_hash)) {
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
+ if (sb->s_flags & MS_ACTIVE)
+ return 1;
+ inode->i_state |= I_WILL_FREE;
}
- inode->i_state |= I_WILL_FREE;
- spin_unlock(&inode_lock);
- write_inode_now(inode, 1);
- spin_lock(&inode_lock);
- inode->i_state &= ~I_WILL_FREE;
- inodes_stat.nr_unused--;
- hlist_del_init(&inode->i_hash);
+ break;
+ case DIP_DELETING:
+ if (!hlist_unhashed(&inode->i_hash))
+ write_inode_now(inode, 1);
+ break;
+ case DIP_UNHASHING:
+ if (!hlist_unhashed(&inode->i_hash)) {
+ inode->i_state &= ~I_WILL_FREE;
+ inodes_stat.nr_unused--;
+ hlist_del_init(&inode->i_hash);
+ }
+ list_del_init(&inode->i_list);
+ list_del_init(&inode->i_sb_list);
+ inode->i_state |= I_FREEING;
+ inodes_stat.nr_inodes--;
+ break;
+ case DIP_DESTROYING:
+ if (inode->i_data.nrpages)
+ truncate_inode_pages(&inode->i_data, 0);
+ clear_inode(inode);
+ wake_up_inode(inode);
+ destroy_inode(inode);
+ break;
}
- list_del_init(&inode->i_list);
- list_del_init(&inode->i_sb_list);
- inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
- spin_unlock(&inode_lock);
- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
- clear_inode(inode);
- wake_up_inode(inode);
- destroy_inode(inode);
+ return 0;
}

/*
@@ -1223,16 +1239,34 @@ static void generic_forget_inode(struct inode *inode)
* inode when the usage count drops to zero, and
* i_nlink is zero.
*/
-void generic_drop_inode(struct inode *inode)
+int generic_drop_inode(struct inode *inode, enum drop_inode_phase phase)
{
if (!inode->i_nlink)
- generic_delete_inode(inode);
+ return generic_delete_inode(inode, phase);
else
- generic_forget_inode(inode);
+ return generic_forget_inode(inode, phase);
}

EXPORT_SYMBOL_GPL(generic_drop_inode);

+
+typedef int (*drop_op_t)(struct inode *inode, enum drop_inode_phase phase);
+static drop_op_t drop_op(struct inode *inode)
+{
+ const struct super_operations *op = inode->i_sb->s_op;
+ drop_op_t drop = generic_drop_inode;
+
+ if (op && op->drop_inode)
+ drop = op->drop_inode;
+ return drop;
+}
+
+static int drop_inode_phase(struct inode *inode, enum drop_inode_phase phase)
+{
+ drop_op_t drop = drop_op(inode);
+ return drop(inode, phase);
+}
+
/*
* Called when we're dropping the last reference
* to an inode.
@@ -1246,12 +1280,16 @@ EXPORT_SYMBOL_GPL(generic_drop_inode);
*/
static inline void iput_final(struct inode *inode)
{
- const struct super_operations *op = inode->i_sb->s_op;
- void (*drop)(struct inode *) = generic_drop_inode;
+ drop_op_t drop = drop_op(inode);

- if (op && op->drop_inode)
- drop = op->drop_inode;
- drop(inode);
+ if (drop(inode, DIP_START_FREEING))
+ return;
+ spin_unlock(&inode_lock);
+ drop(inode, DIP_DELETING);
+ spin_lock(&inode_lock);
+ drop(inode, DIP_UNHASHING);
+ spin_unlock(&inode_lock);
+ drop(inode, DIP_DESTROYING);
}

static void real_iput(struct inode *inode)
@@ -1320,9 +1358,66 @@ static void add_pending_iput(struct postponed_inodes *ppi, struct inode *inode)

static void process_postponed_inodes(struct postponed_inodes *ppi)
{
+ struct inode *inode;
unsigned i;
- for (i = 0; i < ppi->nr; i++)
- real_iput(ppi->inodes[i]);
+
+ /*
+ * Parallel iput is done in four phases:
+ *
+ * A) Removing the inode from sb lists and marking the inode as
+ * I_FREEING under inode_lock.
+ * B) Deleting/truncating/clearing the inode with no lock.
+ * C) Unhashing the inode under inode_lock.
+ * D) Issuing a wakeup on the inode and destroying.
+ */
+
+ /*
+ * Phase (A) Removing inodes from the super_blocks. We do this under
+ * lock and may discover inodes that aren't unused anymore.
+ *
+ * Phase (A) may be cancelled by the fs, in which case we just forget
+ * we were looking at the inode.
+ */
+ spin_lock(&inode_lock);
+ for (i = 0; i < ppi->nr; i++) {
+ int ret;
+ inode = ppi->inodes[i];
+ if (!atomic_dec_and_test(&inode->i_count)) {
+ /* Someone is using this guy. */
+ ppi->inodes[i] = NULL;
+ continue;
+ }
+ ret = drop_inode_phase(inode, DIP_START_FREEING);
+ if (ret)
+ ppi->inodes[i] = NULL;
+ }
+ spin_unlock(&inode_lock);
+
+ /* Phase (B). Deleting of the inode, which may incur writeout. */
+ for (i = 0; i < ppi->nr; i++) {
+ inode = ppi->inodes[i];
+ if (!inode)
+ continue;
+ drop_inode_phase(inode, DIP_DELETING);
+ }
+
+ /* Phase (C). Unhashing of the inode. */
+ spin_lock(&inode_lock);
+ for (i = 0; i < ppi->nr; i++) {
+ inode = ppi->inodes[i];
+ if (!inode)
+ continue;
+ drop_inode_phase(inode, DIP_UNHASHING);
+ }
+ spin_unlock(&inode_lock);
+
+ /* Phase (D). Destroying of the in-core inode. */
+ for (i = 0; i < ppi->nr; i++) {
+ inode = ppi->inodes[i];
+ if (!inode)
+ continue;
+ drop_inode_phase(inode, DIP_DESTROYING);
+ }
}

static void release_postponed_inodes(struct postponed_inodes *ppi)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aa90620..a7822cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1374,13 +1374,20 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);

+enum drop_inode_phase {
+ DIP_START_FREEING,
+ DIP_DELETING,
+ DIP_UNHASHING,
+ DIP_DESTROYING,
+};
+
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);

void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
- void (*drop_inode) (struct inode *);
+ int (*drop_inode) (struct inode *, enum drop_inode_phase phase);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
@@ -1907,8 +1914,8 @@ extern void iput(struct inode *);
extern struct inode * igrab(struct inode *);
extern ino_t iunique(struct super_block *, ino_t);
extern int inode_needs_sync(struct inode *inode);
-extern void generic_delete_inode(struct inode *inode);
-extern void generic_drop_inode(struct inode *inode);
+extern int generic_delete_inode(struct inode *inode, enum drop_inode_phase);
+extern int generic_drop_inode(struct inode *inode, enum drop_inode_phase);
extern void iput_drain_cpu(void);
extern void iput_drain_all(void);

2009-01-17 02:33:45

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 6/8] hugetlbfs drop_inode update

Update hugetlbfs to compile with the new drop_inode callback mechanism that
works with phases.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/hugetlbfs/inode.c | 72 +++++++++++++++++++++++++++++---------------------
1 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6903d37..986fed4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,46 +388,58 @@ static void hugetlbfs_delete_inode(struct inode *inode)
clear_inode(inode);
}

-static void hugetlbfs_forget_inode(struct inode *inode) __releases(inode_lock)
+static int hugetlbfs_forget_inode(struct inode *inode,
+ enum drop_inode_phase phase)
{
struct super_block *sb = inode->i_sb;

- if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_SYNC)))
- list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
- if (!sb || (sb->s_flags & MS_ACTIVE)) {
- spin_unlock(&inode_lock);
- return;
+ switch (phase) {
+ case DIP_START_FREEING:
+ if (!hlist_unhashed(&inode->i_hash)) {
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+ list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
+ if (!sb || (sb->s_flags & MS_ACTIVE))
+ return 1;
+ inode->i_state |= I_WILL_FREE;
}
- inode->i_state |= I_WILL_FREE;
- spin_unlock(&inode_lock);
- /*
- * write_inode_now is a noop as we set BDI_CAP_NO_WRITEBACK
- * in our backing_dev_info.
- */
- write_inode_now(inode, 1);
- spin_lock(&inode_lock);
- inode->i_state &= ~I_WILL_FREE;
- inodes_stat.nr_unused--;
- hlist_del_init(&inode->i_hash);
+ break;
+ case DIP_DELETING:
+ if (!hlist_unhashed(&inode->i_hash)) {
+ /*
+ * write_inode_now is a noop as we set
+ * BDI_CAP_NO_WRITEBACK in our backing_dev_info.
+ */
+ write_inode_now(inode, 1);
+ }
+ break;
+ case DIP_UNHASHING:
+ if (!hlist_unhashed(&inode->i_hash)) {
+ inode->i_state &= ~I_WILL_FREE;
+ inodes_stat.nr_unused--;
+ hlist_del_init(&inode->i_hash);
+ }
+ list_del_init(&inode->i_list);
+ list_del_init(&inode->i_sb_list);
+ inode->i_state |= I_FREEING;
+ inodes_stat.nr_inodes--;
+ break;
+ case DIP_DESTROYING:
+ truncate_hugepages(inode, 0);
+ clear_inode(inode);
+ destroy_inode(inode);
+ break;
}
- list_del_init(&inode->i_list);
- list_del_init(&inode->i_sb_list);
- inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
- spin_unlock(&inode_lock);
- truncate_hugepages(inode, 0);
- clear_inode(inode);
- destroy_inode(inode);
+ return 0;
}

-static void hugetlbfs_drop_inode(struct inode *inode)
+static int hugetlbfs_drop_inode(struct inode *inode,
+ enum drop_inode_phase phase)
{
if (!inode->i_nlink)
- generic_delete_inode(inode);
+ return generic_delete_inode(inode, phase);
else
- hugetlbfs_forget_inode(inode);
+ return hugetlbfs_forget_inode(inode, phase);
}

static inline void

2009-01-17 02:34:11

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 7/8] Make drop_caches flush pending dput()s and iput()s

When doing echo 3 > /proc/sys/vm/drop_caches, we wish to free up as much inode
and dentry slab as possible. This path is updated to flush any pending iputs
and dputs once the shrinkers are called.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/drop_caches.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 3e5637f..1e47470 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -60,6 +60,7 @@ static void drop_slab(void)
do {
nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
} while (nr_objects > 10);
+ dput_drain_all();
}

int drop_caches_sysctl_handler(ctl_table *table, int write,

2009-01-17 02:34:36

by Mike Waychison

[permalink] [raw]
Subject: [PATCH v1 8/8] Make the sync path drain dentries and inodes

When calling sync(2), we should be flushing any pending dputs and iputs so that
any pending deletes are finalized and the metadata updated.

This patch requires more work as it's unclear whether any iput()s in this path
need to be flushed (synchronously?) or not.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/sync.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index ac02b56..5a595ad 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,6 +23,11 @@
*/
static void do_sync(unsigned long wait)
{
+ /*
+ * Begin by making sure that we've flushed out any pending dput()s and
+ * iput()s.
+ */
+ dput_drain_all();
wakeup_pdflush(0);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);

2009-01-17 07:05:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Mike Waychison a écrit :
> We've noticed that at times it can become very easy to have a system begin to
> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
> a lot of dentries are getting finalized at the same time (massive delete and
> large fdtable destructions are two paths I've seen cause problems).
>
> This patchset is an attempt to try and reduce the locking overheads associated
> with final dput() and final iput(). This is done by batching dentries and
> inodes into per-process queues and processing them in 'parallel' to consolidate
> some of the locking.
>
> Besides various workload testing, I threw together a load (at the end of this
> email) that causes massive fdtables (50K sockets by default) to get destroyed
> on each cpu in the system. It also populates the dcache for procfs on those
> tasks for good measure. Comparing lock_stat results (hardware is a Sun x4600
> M2 populated with 8 4-core 2.3GHz packages (32 CPUs) + 128GiB RAM):
>

Hello Mike

Seems quite a large/intrusive infrastructure for a well known problem.
I even wasted some time on it.
But it seems nobody cared too much or people were too busy.

https://kerneltrap.org/mailarchive/linux-netdev/2008/12/11/4397594

(patch 6 should be discarded as followups show it was wrong
[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU)


sockets / pipes dont need dcache_lock or inode_lock at all, I am
sure Google machines also uses sockets :)

Your test/bench program is quite biased (populating dcache for procfs, using
50k filedesc on 32 cpu, not very realistic IMHO).

I had a workload with processes using 1.000.000 file descriptors,
(mainly sockets) and got some latency problems when they had to exit().
This problem was addressed by one cond_resched() added in close_files()
(commit 944be0b224724fcbf63c3a3fe3a5478c325a6547 )


> BEFORE PATCHSET:
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> dcache_lock: 5666485 10086172 11.16 2274036.47 3821690090.32 11042789 14926411 2.78 42138.28 65887138.46
> -----------
> dcache_lock 3016578 [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
> dcache_lock 1844569 [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
> dcache_lock 4223784 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
> dcache_lock 207544 [<ffffffff810bc36a>] d_rehash+0x1b/0x43
> -----------
> dcache_lock 2305449 [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
> dcache_lock 1954160 [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
> dcache_lock 4169163 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
> dcache_lock 866247 [<ffffffff810bc36a>] d_rehash+0x1b/0x43
>
> ...............................................................................................................................................................................................
>
> inode_lock: 202813 203064 12.91 376655.85 37696597.26 5136095 8001156 3.37 15142.77 5033144.13
> ----------
> inode_lock 20192 [<ffffffff810bfdbc>] new_inode+0x27/0x63
> inode_lock 1 [<ffffffff810c6f58>] sync_inode+0x19/0x39
> inode_lock 5 [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
> inode_lock 2 [<ffffffff810c6e92>] __writeback_single_inode+0x1bf/0x26c
> ----------
> inode_lock 20198 [<ffffffff810bfdbc>] new_inode+0x27/0x63
> inode_lock 1 [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
> inode_lock 5 [<ffffffff810c70fc>] generic_sync_sb_inodes+0x33/0x31a
> inode_lock 165016 [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
>
> ...............................................................................................................................................................................................
>
> &sbsec->isec_lock: 34428 34431 16.77 67593.54 3780131.15 1415432 3200261 4.06 13357.96 1180726.21
>
>
>
>
>
>
> AFTER PATCHSET [Note that inode_lock doesn't even show up anymore]:
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> dcache_lock: 2732103 5254824 11.04 1314926.88 2187466928.10 5551173 9751332 2.43 78012.24 42830806.29
> -----------
> dcache_lock 3015590 [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
> dcache_lock 1966978 [<ffffffff810bd118>] d_instantiate+0x2c/0x51
> dcache_lock 258657 [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
> dcache_lock 30 [<ffffffff810b7ad4>] __link_path_walk+0x42d/0x665
> -----------
> dcache_lock 2493589 [<ffffffff810bd118>] d_instantiate+0x2c/0x51
> dcache_lock 1862921 [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
> dcache_lock 882054 [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
> dcache_lock 15504 [<ffffffff810bc5e1>] process_postponed_dentries+0x3e/0x29f
>
> ...............................................................................................................................................................................................
>
>
> ChangeLog:
> [v1] - Jan 16, 2009
> - Initial version
>
> Patches in series:
>
> 1) Deferred batching of dput()
> 2) Parallel dput()
> 3) Deferred batching of iput()
> 4) Fixing iput called from put_super path
> 5) Parallelize iput()
> 6) hugetlbfs drop_inode update
> 7) Make drop_caches flush pending dput()s and iput()s
> 8) Make the sync path drain dentries and inodes
>
> Caveats:
>
> * It's not clear to me how to make 'sync(2)' do what it should. Currently we
> flush pending dputs and iputs before writing anything out, but presumably,
> there may be hidden iputs in the do_sync path that really ought to be
> synchronous.
> * I don't like the changes in patch 4 and it should probably be changed to
> have the ->put_super callback paths call a synchronous version of iput()
> (perhaps sync_iput()?) instead of having to tag the inodes with S_SYNCIPUT.
> * cramfs, gfs2, ocfs2 need to be updated for the new drop_inode semantics.
> ---
>
> fs/block_dev.c | 2
> fs/dcache.c | 374 +++++++++++++++++++++++++++++++++++++----
> fs/drop_caches.c | 1
> fs/ext2/super.c | 2
> fs/gfs2/glock.c | 2
> fs/gfs2/ops_fstype.c | 2
> fs/hugetlbfs/inode.c | 72 +++++---
> fs/inode.c | 441 +++++++++++++++++++++++++++++++++++++++++-------
> fs/ntfs/super.c | 2
> fs/smbfs/inode.c | 2
> fs/super.c | 6 -
> fs/sync.c | 5 +
> include/linux/dcache.h | 1
> include/linux/fs.h | 18 ++
> 14 files changed, 787 insertions(+), 143 deletions(-)
>

2009-01-17 08:12:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

On Fri, Jan 16, 2009 at 06:29:36PM -0800, Mike Waychison wrote:
> We've noticed that at times it can become very easy to have a system begin to
> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
> a lot of dentries are getting finalized at the same time (massive delete and
> large fdtable destructions are two paths I've seen cause problems).
>
> This patchset is an attempt to try and reduce the locking overheads associated
> with final dput() and final iput(). This is done by batching dentries and
> inodes into per-process queues and processing them in 'parallel' to consolidate
> some of the locking.

Hmmmm. This deferring of dput/iput will have the same class of
effects on filesystems as the recent reverted changes to make
generic_delete_inode() an asynchronous process. That is, it
temporally separates the transaction for namespace deletion (i.e.
unlink) from the transaction that completes the inode deletion that
occurs, typically, during ->clear_inode. See the recent thread
titled:

[PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock

For more details.

I suspect that change is likely to cause worse problems than the
async changes in that it doesn't have a cap on the number of
deferred operations.....

> Besides various workload testing,

Details?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-17 10:15:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] Deferred batching of dput()

Hi Mike.

On Fri, Jan 16, 2009 at 06:29:42PM -0800, Mike Waychison ([email protected]) wrote:
> +static void postpone_dput(struct dentry *dentry)
> +{
> + struct postponed_dentries *ppd, *new_ppd;
> +
> +again:
> + ppd = get_cpu_var(postponed_dentries);
> + if (!pending_dput_full(ppd)) {
> + add_pending_dput(ppd, dentry);
> + put_cpu_var(postponed_dentries);
> + return;
> + }
> +
> + /* need to flush out existing pending dentries. */
> + put_cpu_var(postponed_dentries);
> + /* Allocate more space.. */
> + new_ppd = new_postponed_dentries();
> + if (!new_ppd) {
> + /* Take the slow path, memory is low */
> + struct postponed_dentries_onstack ppd_onstack;
> + struct postponed_dentries *ppd;
> +
> + ppd = init_ppd_onstack(&ppd_onstack);
> + add_pending_dput(ppd, dentry);
> + process_postponed_dentries(ppd);
> + return;
> + }

Why don't you want just to put the dentry in the lowmem condition?

--
Evgeniy Polyakov

2009-01-17 10:18:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH v1 3/8] Deferred batching of iput()

On Fri, Jan 16, 2009 at 06:29:52PM -0800, Mike Waychison ([email protected]) wrote:
> +static void postpone_iput(struct inode *inode)
> +{
> + struct postponed_inodes *ppi, *new_ppi;
> +
> +again:
> + ppi = get_cpu_var(postponed_inodes);
> + if (!pending_iput_full(ppi)) {
> + add_pending_iput(ppi, inode);
> + put_cpu_var(postponed_inodes);
> + return;
> + }
> +
> + /* Need to flush out existing pending inodes */
> + put_cpu_var(postponed_inodes);
> + /* Allocate more space.. */
> + new_ppi = new_postponed_inodes();
> + if (!new_ppi) {
> + struct postponed_inodes_onstack ppi_onstack;
> +
> + ppi = init_ppi_onstack(&ppi_onstack);
> + add_pending_iput(ppi, inode);
> + process_postponed_inodes(ppi);
> + return;
> + }

The same here: what just not to call a real_iput() without on-stack
allocation and the line?

--
Evgeniy Polyakov

2009-01-20 19:01:50

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Dave Chinner wrote:
> On Fri, Jan 16, 2009 at 06:29:36PM -0800, Mike Waychison wrote:
>> We've noticed that at times it can become very easy to have a system begin to
>> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
>> a lot of dentries are getting finalized at the same time (massive delete and
>> large fdtable destructions are two paths I've seen cause problems).
>>
>> This patchset is an attempt to try and reduce the locking overheads associated
>> with final dput() and final iput(). This is done by batching dentries and
>> inodes into per-process queues and processing them in 'parallel' to consolidate
>> some of the locking.
>
> Hmmmm. This deferring of dput/iput will have the same class of
> effects on filesystems as the recent reverted changes to make
> generic_delete_inode() an asynchronous process. That is, it
> temporally separates the transaction for namespace deletion (i.e.
> unlink) from the transaction that completes the inode deletion that
> occurs, typically, during ->clear_inode. See the recent thread
> titled:
>
> [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock
>
> For more details.
>
> I suspect that change is likely to cause worse problems than the
> async changes in that it doesn't have a cap on the number of
> deferred operations.....

I'll dig through the archives and try to come up with a response later
today.

>
>> Besides various workload testing,
>
> Details?
>

I ran a couple different workloads, though I was looking for
stability/correctness rather than performance. I ran iozone (ext2 +
ext4/nojournal), dbench (ext2), tbench (ext2), specjbb, unixbench,
kernbench as well as a couple internal benchmarks.

2009-01-20 20:01:24

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Eric Dumazet wrote:
> Mike Waychison a écrit :
>> We've noticed that at times it can become very easy to have a system begin to
>> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
>> a lot of dentries are getting finalized at the same time (massive delete and
>> large fdtable destructions are two paths I've seen cause problems).
>>
>> This patchset is an attempt to try and reduce the locking overheads associated
>> with final dput() and final iput(). This is done by batching dentries and
>> inodes into per-process queues and processing them in 'parallel' to consolidate
>> some of the locking.
>>
>> Besides various workload testing, I threw together a load (at the end of this
>> email) that causes massive fdtables (50K sockets by default) to get destroyed
>> on each cpu in the system. It also populates the dcache for procfs on those
>> tasks for good measure. Comparing lock_stat results (hardware is a Sun x4600
>> M2 populated with 8 4-core 2.3GHz packages (32 CPUs) + 128GiB RAM):
>>
>
> Hello Mike
>
> Seems quite a large/intrusive infrastructure for a well known problem.
> I even wasted some time on it.
> But it seems nobody cared too much or people were too busy.
>
> https://kerneltrap.org/mailarchive/linux-netdev/2008/12/11/4397594
>
> (patch 6 should be discarded as followups show it was wrong
> [PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU)
>
>
> sockets / pipes dont need dcache_lock or inode_lock at all, I am
> sure Google machines also uses sockets :)

Yup :) I'll try to take a look at your patches this week. At a
minimum, the removal of the locks seems highly desirable.

>
> Your test/bench program is quite biased (populating dcache for procfs, using
> 50k filedesc on 32 cpu, not very realistic IMHO).

Yup, extremely biased. It was meant to hurt the dput/iput path
specifically and I used it as a way to compare apples to apples
with/without the changes. It is still representative of a real-world
workload we see though (our frontend servers when they are restarted
have many tcp sockets, easily more than 50K each).

>
> I had a workload with processes using 1.000.000 file descriptors,
> (mainly sockets) and got some latency problems when they had to exit().
> This problem was addressed by one cond_resched() added in close_files()
> (commit 944be0b224724fcbf63c3a3fe3a5478c325a6547 )
>

Yup. We pulled that change into our tree a while back for the same
reason. It doesn't help the lock contention issue though.

2009-01-20 20:07:39

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 1/8] Deferred batching of dput()

Evgeniy Polyakov wrote:
> Hi Mike.
>
> On Fri, Jan 16, 2009 at 06:29:42PM -0800, Mike Waychison ([email protected]) wrote:
>> +static void postpone_dput(struct dentry *dentry)
>> +{
>> + struct postponed_dentries *ppd, *new_ppd;
>> +
>> +again:
>> + ppd = get_cpu_var(postponed_dentries);
>> + if (!pending_dput_full(ppd)) {
>> + add_pending_dput(ppd, dentry);
>> + put_cpu_var(postponed_dentries);
>> + return;
>> + }
>> +
>> + /* need to flush out existing pending dentries. */
>> + put_cpu_var(postponed_dentries);
>> + /* Allocate more space.. */
>> + new_ppd = new_postponed_dentries();
>> + if (!new_ppd) {
>> + /* Take the slow path, memory is low */
>> + struct postponed_dentries_onstack ppd_onstack;
>> + struct postponed_dentries *ppd;
>> +
>> + ppd = init_ppd_onstack(&ppd_onstack);
>> + add_pending_dput(ppd, dentry);
>> + process_postponed_dentries(ppd);
>> + return;
>> + }
>
> Why don't you want just to put the dentry in the lowmem condition?
>

You're right. This path could just use the original dput path. I'll
fix it up in the next version.

2009-01-20 20:07:59

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 3/8] Deferred batching of iput()

Evgeniy Polyakov wrote:
> On Fri, Jan 16, 2009 at 06:29:52PM -0800, Mike Waychison ([email protected]) wrote:
>> +static void postpone_iput(struct inode *inode)
>> +{
>> + struct postponed_inodes *ppi, *new_ppi;
>> +
>> +again:
>> + ppi = get_cpu_var(postponed_inodes);
>> + if (!pending_iput_full(ppi)) {
>> + add_pending_iput(ppi, inode);
>> + put_cpu_var(postponed_inodes);
>> + return;
>> + }
>> +
>> + /* Need to flush out existing pending inodes */
>> + put_cpu_var(postponed_inodes);
>> + /* Allocate more space.. */
>> + new_ppi = new_postponed_inodes();
>> + if (!new_ppi) {
>> + struct postponed_inodes_onstack ppi_onstack;
>> +
>> + ppi = init_ppi_onstack(&ppi_onstack);
>> + add_pending_iput(ppi, inode);
>> + process_postponed_inodes(ppi);
>> + return;
>> + }
>
> The same here: what just not to call a real_iput() without on-stack
> allocation and the line?
>

Agreed. Will fix in the next version.

2009-01-21 05:53:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Mike Waychison <[email protected]> writes:

> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock())

I'm not sure how something can livelock in atomic_dec_and_lock which
doesn't take a spinlock itself? Are you saying you run into NUMA memory
unfairness here? Or did I misparse you?

> This patchset is an attempt to try and reduce the locking overheads associated
> with final dput() and final iput(). This is done by batching dentries and
> inodes into per-process queues and processing them in 'parallel' to consolidate
> some of the locking.

I was wondering what this does to the latencies when dput/iput
is only done for very objects. Does it increase costs then
significantly?

As a high level comment it seems like a lot of work to work
around global locks, like the inode_lock, where it might be better to
just split the lock up? Mind you I don't have a clear proposal
how to do that, but surely it's doable somehow.

-Andi

--
[email protected] -- Speaking for myself only.

2009-01-21 06:21:17

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Andi Kleen wrote:
> Mike Waychison <[email protected]> writes:
>
>> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock())
>
> I'm not sure how something can livelock in atomic_dec_and_lock which
> doesn't take a spinlock itself? Are you saying you run into NUMA memory
> unfairness here? Or did I misparse you?

By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().
It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.

There are likely NUMA unfairness issues at play, but it's not the main
worry at this point.

>
>> This patchset is an attempt to try and reduce the locking overheads associated
>> with final dput() and final iput(). This is done by batching dentries and
>> inodes into per-process queues and processing them in 'parallel' to consolidate
>> some of the locking.
>
> I was wondering what this does to the latencies when dput/iput
> is only done for very objects. Does it increase costs then
> significantly?

very objects?

>
> As a high level comment it seems like a lot of work to work
> around global locks, like the inode_lock, where it might be better to
> just split the lock up? Mind you I don't have a clear proposal
> how to do that, but surely it's doable somehow.
>

Perhaps.. the only plausible way I can think this would be doable would
be to rework the global resources (like the global inode_unused LRU list
and deal with inode state transitions), but even then, some sort of
consistency needs to happen at the super_block level, which means the
smallest I can see the lock becoming would be per-super_block, which
doesn't solve the problem afaict.

2009-01-21 08:33:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

On Tue, Jan 20, 2009 at 10:22:00PM -0800, Mike Waychison wrote:
> Andi Kleen wrote:
> >Mike Waychison <[email protected]> writes:
> >
> >>livelock on dcache_lock/inode_lock (specifically in
> >>atomic_dec_and_lock())
> >
> >I'm not sure how something can livelock in atomic_dec_and_lock which
> >doesn't take a spinlock itself? Are you saying you run into NUMA memory
> >unfairness here? Or did I misparse you?
>
> By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().

Ok. So it's basically just the lock that is taken?

In theory one could likely provide an x86 specific dec-and_lock that
might perform better and doesn't lock if the count is still > 0, but that
would only help if the reference count is still > 0. Is that a common
situation in your test?

> It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.
>
> There are likely NUMA unfairness issues at play, but it's not the main
> worry at this point.
>
> >
> >>This patchset is an attempt to try and reduce the locking overheads
> >>associated
> >>with final dput() and final iput(). This is done by batching dentries and
> >>inodes into per-process queues and processing them in 'parallel' to
> >>consolidate
> >>some of the locking.
> >
> >I was wondering what this does to the latencies when dput/iput
> >is only done for very objects. Does it increase costs then
> >significantly?
>
> very objects?

Sorry.

"is only done for very few objects". Somnhow the few got lost.
Basically latency in the unloaded case.

I always worry when people do complicated things for the high
load case how the more usual "do it for a single object" workload
fares.

>
> >
> >As a high level comment it seems like a lot of work to work
> >around global locks, like the inode_lock, where it might be better to
> >just split the lock up? Mind you I don't have a clear proposal
> >how to do that, but surely it's doable somehow.
> >
>
> Perhaps.. the only plausible way I can think this would be doable would
> be to rework the global resources (like the global inode_unused LRU list

One simple way would be to just use multiple lists with an own lock
each. I doubt that would impact the LRU behaviour very much.

> and deal with inode state transitions), but even then, some sort of
> consistency needs to happen at the super_block level,

The sb could also look at multiple lists?

-Andi
--
[email protected] -- Speaking for myself only.

2009-01-21 17:28:47

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Andi Kleen wrote:
> On Tue, Jan 20, 2009 at 10:22:00PM -0800, Mike Waychison wrote:
>> Andi Kleen wrote:
>>> Mike Waychison <[email protected]> writes:
>>>
>>>> livelock on dcache_lock/inode_lock (specifically in
>>>> atomic_dec_and_lock())
>>> I'm not sure how something can livelock in atomic_dec_and_lock which
>>> doesn't take a spinlock itself? Are you saying you run into NUMA memory
>>> unfairness here? Or did I misparse you?
>> By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().
>
> Ok. So it's basically just the lock that is taken?
>
> In theory one could likely provide an x86 specific dec-and_lock that
> might perform better and doesn't lock if the count is still > 0, but that
> would only help if the reference count is still > 0. Is that a common
> situation in your test?
>

This is precisely what _atomic_dec_and_lock does. It only locks if it
looks like the counter will be hitting 0 with this dec.

It is a common case for the count to be > 1, however the troubles we see
happen with there are a large number of final dputs (and thus also final
iputs)

>> It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.
>>
>> There are likely NUMA unfairness issues at play, but it's not the main
>> worry at this point.
>>
>>>> This patchset is an attempt to try and reduce the locking overheads
>>>> associated
>>>> with final dput() and final iput(). This is done by batching dentries and
>>>> inodes into per-process queues and processing them in 'parallel' to
>>>> consolidate
>>>> some of the locking.
>>> I was wondering what this does to the latencies when dput/iput
>>> is only done for very objects. Does it increase costs then
>>> significantly?
>> very objects?
>
> Sorry.
>
> "is only done for very few objects". Somnhow the few got lost.
> Basically latency in the unloaded case.
>
> I always worry when people do complicated things for the high
> load case how the more usual "do it for a single object" workload
> fares.

Hmm. I'm not sure. The inline latency should fair out to be better
than before on average (common case is to disable pre-emption, enqueue
on the postponed list and re-enable pre-emption, which is fast). It's
the clearing of the queues that may take a little longer, though the
locking overhead should be the same. The trouble is that the queue
itself may not be cache hot and that would add a bit of latency.

Would you know of a good way to detect this latency for fewer objects?

>
>>> As a high level comment it seems like a lot of work to work
>>> around global locks, like the inode_lock, where it might be better to
>>> just split the lock up? Mind you I don't have a clear proposal
>>> how to do that, but surely it's doable somehow.
>>>
>> Perhaps.. the only plausible way I can think this would be doable would
>> be to rework the global resources (like the global inode_unused LRU list
>
> One simple way would be to just use multiple lists with an own lock
> each. I doubt that would impact the LRU behaviour very much.
>
>> and deal with inode state transitions), but even then, some sort of
>> consistency needs to happen at the super_block level,
>
> The sb could also look at multiple lists?

Maybe (hashing on inode nr perhaps?).

The other trouble is that dcache_lock which for the most part stands in
from of the inode_lock. While developing this patchset, I was
originally chasing dcache_lock contention issues. Once I relieved the
contention via deferred batching and parallel processing of dput(), I
was surprised to see that performance didn't get any better because now
I was mostly contending on inode_lock :\

The dcache_lock again could maybe be broken out to be per super_block,
but I have no idea how to make that work with mountpoint traversals.
Making this guy become a set of smaller locks seems difficult at best.

2009-01-29 02:09:44

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

Dave Chinner wrote:
> On Fri, Jan 16, 2009 at 06:29:36PM -0800, Mike Waychison wrote:
>> We've noticed that at times it can become very easy to have a system begin to
>> livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
>> a lot of dentries are getting finalized at the same time (massive delete and
>> large fdtable destructions are two paths I've seen cause problems).
>>
>> This patchset is an attempt to try and reduce the locking overheads associated
>> with final dput() and final iput(). This is done by batching dentries and
>> inodes into per-process queues and processing them in 'parallel' to consolidate
>> some of the locking.
>
> Hmmmm. This deferring of dput/iput will have the same class of
> effects on filesystems as the recent reverted changes to make
> generic_delete_inode() an asynchronous process. That is, it
> temporally separates the transaction for namespace deletion (i.e.
> unlink) from the transaction that completes the inode deletion that
> occurs, typically, during ->clear_inode. See the recent thread
> titled:
>
> [PATCH] async: Don't call async_synchronize_full_special() while holding sb_lock
>
> For more details.
>
> I suspect that change is likely to cause worse problems than the
> async changes in that it doesn't have a cap on the number of
> deferred operations.....

*sigh*. So I did some testing with XFS today and you're right,
separating the namespace operation from the actual file delete really
slows things down.

I'll follow up with a v2 patchset with more precise details, but
basically, I create (on one fs) 100K files (each empty) per cpu (8 core
machine). I then time how long it takes to have one-process-per-cpu
delete it's own file hierarchy of 100K files.

XFS performs horribly at this test (taking approx 34 minutes to
complete). With my changes, it takes approx 57 minutes. For
comparison, ext2 takes ~4.5 seconds, ext4 ~30 seconds and ext4 without a
journal ~4.1 seconds.

Seems to me like there is a design issue causing XFS to suck at mass
deletes, and I can understand why you wouldn't want to separate the
namespace+inode drop.

What say you to letting the fs itself specify whether it's inodes should
be handled serially (like it is today) or in deferred batches? Patch 4
of this series introduced an I_SYNCIPUT flag to deal with umount issues;
perhaps it would make sense to have inodes from XFS tagged with this flag?

Mike Waychison