Every little factor of 25 performance increase really helps.
Ramback is a new virtual device with the ability to back a ramdisk
by a real disk, obtaining the performance level of a ramdisk but with
the data durability of a hard disk. To work this magic, ramback needs
a little help from a UPS. In a typical test, ramback reduced a 25
second file operation[1] to under one second including sync. Even
greater gains are possible for seek-intensive applications.
The difference between ramback and an ordinary ramdisk is: when the
machine powers down the data does not vanish because it is continuously
saved to backing store. When line power returns, the backing store
repopulates the ramdisk while allowing application io to proceed
concurrently. Once fully populated, a little green light winks on and
file operations once again run at ramdisk speed.
So now you can ask some hard questions: what if the power goes out
completely or the host crashes or something else goes wrong while
critical data is still in the ramdisk? Easy: use reliable components.
Don't crash. Measure your UPS window. This is not much to ask in
order to transform your mild mannered hard disk into a raging superdisk
able to leap tall benchmarks at a single bound.
If line power goes out while ramback is running, the UPS kicks in and a
power management script switches the driver from writeback to
writethrough mode. Ramback proceeds to save all remaining dirty data
while forcing each new application write through to backing store
immediately.
If UPS power runs out while ramback still holds unflushed dirty data
then things get ugly. Hopefully a fsck -f will be able to pull
something useful out of the mess. (This is where you might want to be
running Ext3.) The name of the game is to install sufficient UPS power
to get your dirty ramdisk data onto stable storage this time, every
time.
The basic design premise of ramback is alluringly simple: each write to
a ramdisk sets a per-chunk dirty bit. A kernel daemon continuously
scans for and flushes dirty chunks to backing store. It sounds easy,
but in practice a number of additional requirements increase the design
complexity considerably:
* Previously saved data must be reloaded into the ramdisk on startup.
* Applications need to be able to read and write ramback data during
initial loading.
* If line power is restored before the battery runs out then ramdisk
level performance should resume immediately.
* Applications data should continue to be available and writable even
during emergency data flushing.
* Racy application writes should not be able to cause the contents of
backing store to diverge from the contents of the ramdisk.
* If UPS power is limited then maximum dirty data must be limited as
well, so that power does not run out while dirty data remains.
* Cannot transfer directly between ramdisk and backing store, so must
first transfer into memory then relaunch to destination.
* Cannot submit a transfer directly from completion interrupt, so a
helper daemon is needed.
* Per chunk locking is not feasible for a terabyte scale ramdisk.
In addition, there are two nasty races to consider:
1) Populate race
A chunk dirtied by an application write may be overwritten by a
chunk simultaneously read from backing store during initial populate
2) Flush race
A dirty chunk flush must not overwrite an application write. Even
though applications IO always goes to the ramdisk, in flush mode the
resulting writethrough transfer may overtake a previously launched
dirty chunk flush and stale data will land in backing store. Also,
so that the backing store always has exactly the contents of the
ramdisk for each completed write, we would like to preserve
application write order for overlapping writes, even though this can
only happen with a racy application.
To give some sense of the resulting complexity, here is the algorithm I
implemented to close the writethrough race:
Write algorithm when in writethrough mode
Kick writethrough:
If writethrough queue empty, done
If head of writethrough queue does not overlap any member of
storing list
Remove head from writethrough queue, add to storing list
and submit
(else it will be submitted when save completes)
On application write in writethrough mode: (after populating)
Mark region clean
Add to tail of writethrough queue
Kick writethrough
On save complete:
Remove from storing list
Kick writethrough
On writethrough complete:
Remove from storing list
Kick writethrough
Endio on original write
On daemon finding a dirty chunk:
Mark chunk clean and add to storing list
submit it
For the most part, ramback just solves a classic cache consistency
problem. As such, some of the techniques will be familiar to vm
hackers, such as clearing the chunk dirty bit immediately on placing
dirty data under writeout, and keeping track of inflight dirty data
separately. There are significant differences as well, but this post
is already long so I will save these details for later.
What Works Now:
* Ramdisk populates from backing store when created.
* Application io allowed during initial populating.
* Application io allowed during flush on line power loss.
* Populate vs application write race closed in theory.
* Flush vs application writethrough race closed in theory.
* Writethrough mode on line power loss apparently working.
* Proc interface controls writeback vs writethrough mode.
* Proc interface displays useful status.
Corners Cut in the Interest of Releasing Early:
* Simple linear bitmap scan costs too much cpu.
* Linear algorithms for list searching will not scale.
* Could use atomic ops instead of spinlocks in places.
* Should load and save range instead of single chunks.
* Serializing all writes in writethrough mode is overkill.
* Introduce populate vs application write balancing.
* Introduce flush vs application writethrough balancing.
* Handle chunk size other than PAGE_SIZE.
* Handle noninteger number of chunks.
* Too much cut and paste bio code.
Bugs:
* Oops on dmsetup remove.
* Buggy spinlocks, so no smp for now.
* Block layer anti deadlock measures needed.
* Writeback transfers sometimes starve application writes.
* Backing disk is sometimes idle while populating.
* Ramdisk chunks sometimes stay dirty forever.
* Undoubtedly more under the rug...
Plea for help:
This driver is ready to try for a sufficiently brave developer. It will
deadlock and livelock in various ways and you will have to reboot to
remove it. But it can already be coaxed into running well enough for
benchmarks, and when it solidifies it will be pretty darn amazing.
Note that massive amounts of tracing output can be enabled or disabled,
very handy for finding out how it tied itself in a knot.
* If you would like to carve your name in this driver, please send me
a bug fix.
* If you would like to carve your name in the man page, please send an
oops or SysRq backtrace to lkml so somebody can send me a bug fix.
* Please send beer for no reason at all.
Many thanks to Violin Memory[2] who inspired and supported the ramback
effort. Can you guess why they are interested in stable backing store
for large ramdisks?
[1] Untar a 2.2 kernel on a laptop
[2] http://www.violin-memory.com
User Documentation
------------------
Create a ramback with chunksize 1<<12 (only size supported for now):
echo 0 100 ramback /dev/ramdev /dev/backdev | dmsetup create <name>
Set ramback to flush mode:
echo 1 >/proc/driver/ramback/<devname>
Set ramback to normal mode:
echo 0 >/proc/driver/ramback/<devname>
Turn trace output on:
echo 256 >/proc/driver/ramback/<devname>
Show ramback status:
cat /proc/driver/ramback/<devname>
Progress monitor:
watch -n1 cat /proc/driver/ramback/<devname>
Patch applies to 2.6.23.12:
cd linux-2.6.23.12 && cat this.mail >patch -p1
A note on device mapper target names:
There is actually no way for a device mapper target to obtain its
own name, which is apparently by design, because each device mapper
device is actually a table of devices and the name of an individual
target would have to include something to distinguish it from other
targets belonging to the same virtual device. This is actually just
a symptom of deep design flaws in device mapper. For today, ramback
uses the ascii address of its struct dm_target as its own name. If
you only have one this will not be a problem, but something really
needs to be done about this. Namely, rewriting dm-ramback as a
standard block device. There is actually no reason for ramback to
be a device mapper device other than lack of a library for creating
standard block devices, and that can be fixed.
--- 2.6.23.12.base/drivers/md/dm-ramback.c 2008-03-08 16:47:29.000000000 -0800
+++ 2.6.23.12/drivers/md/dm-ramback.c 2008-03-09 22:54:19.000000000 -0700
@@ -0,0 +1,962 @@
+#include <linux/version.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <asm/bug.h>
+#include <linux/bio.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/blkdev.h>
+#include <linux/kthread.h>
+#include <linux/vmalloc.h>
+#include "dm.h"
+
+#include <linux/delay.h>
+
+#define warn(string, args...) do { printk("%s: " string "\n", __func__, ##args); } while (0)
+#define error(string, args...) do { warn(string, ##args); BUG(); } while (0)
+#define assert(expr) do { if (!(expr)) error("Assertion " #expr " failed!\n"); } while (0)
+#define enable(args) args
+#define disable(args)
+
+/*
+ * Ramback version 0.0
+ *
+ * Backing store for ramdisks
+ *
+ * (C) 2008 Violin Memory Inc.
+ *
+ * Original Author: Daniel Phillips <[email protected]>
+ *
+ * License: GPL v2
+ */
+
+#define SECTOR_SHIFT 9
+#define FLUSH_FLAG 1
+#define TRACE_FLAG (1 << 8)
+
+typedef uint32_t chunk_t; // up to 16 TB with 4K chunksize
+
+/*
+ * Flush mode:
+ *
+ * In flush mode, the ramback daemon stops loading but continues flushing.
+ * Application IO is forced through synchrously to the backing device. On
+ * exit from flush mode, loading resumes if it was incomplete. (Once fully
+ * populated, no chunk may become empty again.)
+ *
+ * Populate transfers may occur even in writethrough mode, just for partial
+ * chunks at the beginning and end of a writethrough region.
+ *
+ * Each daemon write in flush mode marks chunks clean before launching so
+ * these writes will never overwrite application writes, but each application
+ * write has to wait for any overlapping flush to complete before proceeding,
+ * to prevent the latter from stomping on top of the former. See nasty race.
+ */
+
+struct devinfo {
+ spinlock_t lock;
+ wait_queue_head_t fast_wait, slow_wait;
+ struct dm_dev *dm_ramdev, *dm_backdev;
+ struct block_device *ramdev, *backdev;
+ struct task_struct *fast_daemon, *slow_daemon;
+ struct list_head loading, storing, fast_submits, slow_submits, thru_queue;
+ struct hook *prehook;
+ chunk_t chunks, dirty, flushing, inflight, inflight_max;
+ unsigned flags, chunkshift, chunk_sector_shift, populated;
+ long long loaded, saved;
+ unsigned char *state;
+};
+
+/*
+ * Attach some working space to a bio, also remember how to chain back to
+ * original endio if endio had to be hooked for custom processing.
+ */
+struct hook {
+ void *old_private;
+ bio_end_io_t *old_endio;
+ struct list_head member, queue;
+ struct bio *bio, *cloned;
+ struct devinfo *info;
+ chunk_t start, limit;
+};
+
+/*
+ * Statistics
+ *
+ * Inflight and dirty counts need to be computed accurately because they
+ * control daemon wakeup. Dirty count must be accounted on transition between
+ * clean and dirty (in future, also empty to dirty). Needless to say, must
+ * only account under the lock.
+ */
+
+/* proc interface */
+
+static struct proc_dir_entry *ramback_proc_root;
+
+static int ramback_proc_show(struct seq_file *seq, void *offset)
+{
+ struct devinfo *info = seq->private;
+ seq_printf(seq, "flags: %i\n", info->flags);
+ seq_printf(seq, "trace: %i\n", !!(info->flags & TRACE_FLAG));
+ seq_printf(seq, "loading: %i\n", !info->populated);
+ seq_printf(seq, "storing: %i\n", !!(info->flags & FLUSH_FLAG));
+ seq_printf(seq, "chunks: %i\n", info->chunks);
+ seq_printf(seq, "loaded: %Li\n", info->loaded);
+ seq_printf(seq, "stored: %Li\n", info->saved);
+ seq_printf(seq, "dirty: %i\n", info->dirty);
+ seq_printf(seq, "inflight: %i\n", info->inflight);
+ return 0;
+}
+
+static ssize_t ramback_proc_write(struct file *file, const char __user *buf, size_t count, loff_t *offset)
+{
+ struct devinfo *info = PROC_I(file->f_dentry->d_inode)->pde->data;
+ char text[16], *end;
+ int n;
+ memset(text, 0, sizeof(text));
+ if (count >= sizeof(text))
+ return -EINVAL;
+ if (copy_from_user(text, buf, count))
+ return -EFAULT;
+ n = simple_strtoul(text, &end, 10);
+ if (end == text)
+ return -EINVAL;
+ if (n & 1) // factor me
+ info->flags |= FLUSH_FLAG;
+ else
+ info->flags &= ~FLUSH_FLAG;
+
+ if (n & 0x100) // factor me
+ info->flags |= TRACE_FLAG;
+ else
+ info->flags &= ~TRACE_FLAG;
+ return count;
+}
+
+static int ramback_proc_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ramback_proc_show, PDE(inode)->data);
+}
+
+static struct file_operations ramback_proc_fops = {
+ .open = ramback_proc_open,
+ .read = seq_read,
+ .write = ramback_proc_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static void ramback_proc_create(struct dm_target *target, void *data)
+{
+ struct proc_dir_entry *proc;
+ char name[24];
+ snprintf(name, sizeof(name), "%p", target);
+ proc = create_proc_entry(name, 0, ramback_proc_root);
+ proc->owner = THIS_MODULE;
+ proc->data = data;
+ proc->proc_fops = &ramback_proc_fops;
+}
+
+static void ramback_proc_remove(struct dm_target *target)
+{
+ char name[24];
+ snprintf(name, sizeof(name), "%p", target);
+ remove_proc_entry(name, ramback_proc_root);
+}
+
+/* The driver */
+
+#if 0
+static void show_loading(struct devinfo *info)
+{
+ struct list_head *list;
+ int paranoid = 0;
+ printk("loading: ");
+ list_for_each(list, &info->loading) {
+ struct hook *hook = list_entry(list, struct hook, member);
+ printk("%p (%i..%i) ", hook->bio, hook->start, hook->limit - 1);
+ if (++paranoid > 10000) {
+ printk("list way too long\n");
+ break;
+ }
+ }
+ printk("\n");
+}
+#endif
+
+static void trace(struct devinfo *info, const char *fmt, ...)
+{
+ if (info->flags & TRACE_FLAG) {
+ va_list args;
+ va_start(args, fmt);
+ vprintk(fmt, args);
+ va_end(args);
+ }
+}
+
+static struct kmem_cache *ramback_hooks;
+
+static struct hook *alloc_hook(void)
+{
+ return kmem_cache_alloc(ramback_hooks, GFP_NOIO|__GFP_NOFAIL);
+}
+
+static void free_hook(struct hook *hook)
+{
+ kmem_cache_free(ramback_hooks, hook);
+}
+
+static void unhook_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ bio->bi_end_io = hook->old_endio;
+ bio->bi_private = hook->old_private;
+ free_hook(hook);
+ bio_endio(bio, bio->bi_size, error);
+}
+
+static void free_bio_pages(struct bio *bio)
+{
+ int i;
+ for (i = 0; i < bio->bi_vcnt; i++) {
+ __free_pages(bio->bi_io_vec[i].bv_page, 0);
+ bio->bi_io_vec[i].bv_page = (void *)0xdeadbeef;
+ }
+}
+
+enum chunk_state { CHUNK_EMPTY, CHUNK_CLEAN, CHUNK_DIRTY };
+
+static unsigned get_chunk_state(struct devinfo *info, chunk_t chunk)
+{
+ unsigned shift = 2 * (chunk & 3), offset = chunk >> 2, mask = 3;
+ return (info->state[offset] >> shift) & mask;
+}
+
+static void set_chunk_state(struct devinfo *info, chunk_t chunk, int state)
+{
+ unsigned shift = 2 * (chunk & 3), offset = chunk >> 2, mask = 3;
+ info->state[offset] = (info->state[offset] & ~(mask << shift)) | (state << shift);
+}
+
+static int is_empty(struct devinfo *info, chunk_t chunk)
+{
+ return get_chunk_state(info, chunk) == CHUNK_EMPTY;
+}
+
+static int is_dirty(struct devinfo *info, chunk_t chunk)
+{
+ return get_chunk_state(info, chunk) == CHUNK_DIRTY;
+}
+
+static int is_loading(struct devinfo *info, chunk_t chunk)
+{
+ struct list_head *list;
+ list_for_each(list, &info->loading) {
+ struct hook *this = list_entry(list, struct hook, member);
+ if (chunk >= this->start && chunk < this->limit)
+ return 1;
+ }
+ return 0;
+}
+
+static int change_state(struct devinfo *info, chunk_t chunk, chunk_t limit, int state)
+{
+ int changed = 0;
+ for (; chunk < limit; chunk++) {
+ if (get_chunk_state(info, chunk) == state)
+ continue;
+ set_chunk_state(info, chunk, state);
+ changed++;
+ }
+ return changed;
+}
+
+static int set_clean_locked(struct devinfo *info, chunk_t chunk, chunk_t limit)
+{
+ return change_state(info, chunk, limit, CHUNK_CLEAN);
+}
+
+static int set_dirty_locked(struct devinfo *info, chunk_t chunk, chunk_t limit)
+{
+ return change_state(info, chunk, limit, CHUNK_DIRTY);
+}
+
+static void queue_to_ramdev(struct hook *hook)
+{
+ hook->bio->bi_bdev = hook->info->ramdev;
+ list_add_tail(&hook->queue, &hook->info->fast_submits);
+ wake_up(&hook->info->fast_wait);
+
+}
+
+static void queue_to_backdev(struct hook *hook)
+{
+ hook->bio->bi_bdev = hook->info->backdev;
+ list_add_tail(&hook->queue, &hook->info->slow_submits);
+ wake_up(&hook->info->slow_wait);
+
+}
+
+/*
+ * Nasty race: daemon populate must not overwrite application write
+ * Application read or write has to
+ */
+
+/*
+ * Nasty race: daemon writeback must not overwrite application writethrough.
+ * Also, we try to have application writes land on the backing dev in the same
+ * order they arrived on the ramdisk, so that when the daemon has fully synced
+ * up all writeback chunks, the backing store data is henceforth always a point
+ * in time version of the ramdisk. This is done crudely by simply serializing
+ * application writes when in writethrough mode. Very slow! However, bear in
+ * mind that the line power is off at this point so we care a whole lot more
+ * about data consistency than how fast an app can write. It should consider
+ * itself lucky to be allowed to write at all, its lights may go out soon.
+ *
+ * Write algorithm when in writethrough mode
+ *
+ * Kick writethrough if any:
+ * If writethrough queue empty, done
+ * If head of writethrough queue does not overlap any member of storing list
+ * remove head from writethrough queue, add to storing list and submit
+ * (else it will be submitted when save completes)
+ *
+ * On application write in writethrough mode: (after populating)
+ * Mark region clean
+ * Add to tail of writethrough queue
+ * Kick writethrough queue
+ *
+ * On save complete:
+ * Remove from storing list
+ * Kick writethrough
+ *
+ * On writethrough complete:
+ * as for save complete plus endio on original write
+ *
+ * On daemon finding a dirty chunk:
+ * Mark chunk clean and add chunk to storing list
+ * submit it
+ */
+
+/*
+ * If head of writethrough queue does not overlap any member of storing list
+ * then submit it.
+ */
+static void kick_thru(struct devinfo *info)
+{
+ struct hook *hook;
+ spin_lock(&info->lock);
+ if (!list_empty(&info->thru_queue)) {
+ struct list_head *list;
+ hook = list_entry(info->thru_queue.next, struct hook, queue);
+ list_for_each(list, &info->storing) {
+ struct hook *that = list_entry(list, struct hook, member);
+ if ((hook->start < that->limit && hook->limit >= that->start))
+ goto overlap;
+ }
+ trace(info, ">>> kick_thru bio %p: chunk %i..%i\n", hook->bio, hook->start, hook->limit - 1);
+ list_del(&hook->queue);
+ list_add_tail(&hook->member, &hook->info->storing);
+ queue_to_ramdev(hook);
+ }
+overlap:
+ spin_unlock(&info->lock);
+}
+
+/* Twisty little maze of io completions, all different */
+
+/*
+ * Completes a transfer to populate a chunk (later a range) of the
+ * ramdisk and relaunches any application IO that had to wait for empty
+ * chunks to be populated.
+ */
+static int load_write_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ struct list_head *list, *next;
+ spinlock_t *lock = &info->lock;
+ chunk_t chunk = hook->start;
+ trace(info, "load_write_endio on bio %p chunk %zi\n", bio, chunk);
+
+ spin_lock(lock);
+ info->loaded++;
+ info->inflight--;
+ if (info->inflight == info->inflight_max / 2)
+ wake_up(&info->slow_wait);
+ BUG_ON(!is_empty(info, chunk));
+ set_clean_locked(info, hook->start, hook->limit);
+ //hexdump(info->state, (info->chunks + 3) >> 2);
+ list_del(&hook->member);
+ put_page(bio->bi_io_vec[0].bv_page);
+ free_hook(bio->bi_private);
+ bio_put(bio);
+
+ //show_loading(info);
+ list_for_each_safe(list, next, &info->loading) {
+ struct hook *hook = list_entry(list, struct hook, member);
+ for (chunk = hook->start; chunk < hook->limit; chunk++)
+ if (is_empty(info, chunk))
+ goto keep;
+ trace(info, "unblock bio %p\n", hook->bio);
+ list_del(&hook->member);
+ queue_to_backdev(hook); // !!! _fast
+keep:
+ continue;
+ }
+ spin_unlock(lock);
+ return 0;
+}
+
+static int load_read_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ trace(info, "load_read_endio on bio %p chunk %zi\n", bio, hook->start);
+ bio->bi_rw |= WRITE; // !!! need a set_bio_dir(bio) function
+ bio->bi_end_io = load_write_endio;
+ bio->bi_sector = hook->start << info->chunk_sector_shift;
+ bio->bi_size = (hook->limit - hook->start) << info->chunkshift;
+ bio->bi_next = NULL;
+ bio->bi_idx = 0;
+ queue_to_ramdev(hook);
+ return 0;
+}
+
+static int save_write_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ trace(info, "save_write_endio on bio %p chunk %zi\n", bio, hook->start);
+ put_page(bio->bi_io_vec[0].bv_page); // !!! handle range
+ list_del(&hook->member);
+ free_hook(bio->bi_private);
+ bio_put(bio);
+ spin_lock(&info->lock);
+ info->saved++;
+ info->inflight--;
+ spin_unlock(&info->lock);
+ kick_thru(info);
+ return 0;
+}
+
+static int save_read_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ trace(info, "save_read_endio on bio %p chunk %zi\n", bio, hook->start);
+ bio->bi_rw |= WRITE; // !!! need a set_bio_dir(bio) function
+ bio->bi_end_io = save_write_endio;
+ bio->bi_sector = hook->start << info->chunk_sector_shift;
+ bio->bi_size = (hook->limit - hook->start) << info->chunkshift;
+ bio->bi_next = NULL;
+ bio->bi_idx = 0;
+ queue_to_backdev(hook);
+ return 0;
+}
+
+
+static int thru_write_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ struct bio *cloned = hook->cloned;
+ trace(info, "thru_write_endio on bio %p chunk %zi\n", bio, hook->start);
+ free_bio_pages(bio);
+ bio_put(bio);
+
+ list_del(&hook->member);
+ unhook_endio(cloned, done, error);
+ spin_lock(&info->lock);
+ info->inflight--;
+ spin_unlock(&info->lock);
+ kick_thru(info);
+ return 0;
+}
+
+static int thru_read_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ trace(info, "thru_read_endio on bio %p chunk %zi\n", bio, hook->start);
+ bio->bi_rw |= WRITE;
+ bio->bi_end_io = thru_write_endio;
+ bio->bi_sector = hook->start << info->chunk_sector_shift;
+ bio->bi_size = (hook->limit - hook->start) << info->chunkshift;
+ bio->bi_next = NULL;
+ bio->bi_idx = 0;
+ queue_to_backdev(hook);
+ return 0;
+}
+
+static int write_endio(struct bio *bio, unsigned done, int error)
+{
+ struct hook *hook = bio->bi_private;
+ struct devinfo *info = hook->info;
+ chunk_t dirtied;
+
+ trace(info, ">>> write_endio bio %p: chunk %i..%i\n", bio, hook->start, hook->limit - 1);
+ if ((info->flags & FLUSH_FLAG)) {
+ trace(info, ">>> write_endio writethrough\n");
+ bio->bi_end_io = thru_read_endio;
+ bio->bi_size = 0; /* will allocate pages in submit_list */
+ spin_lock(&info->lock);
+ info->dirty -= set_clean_locked(info, hook->start, hook->limit);
+ list_add_tail(&hook->queue, &info->thru_queue);
+ spin_unlock(&info->lock);
+ kick_thru(info);
+ return 0;
+ }
+ spin_lock(&info->lock); // irqsave!!!
+ info->dirty += dirtied = set_dirty_locked(info, hook->start, hook->limit);
+ spin_unlock(&info->lock);
+ //hexdump(info->state, (info->chunks + 3) >> 2);
+ unhook_endio(bio, done, error);
+ if (dirtied)
+ wake_up(&info->slow_wait);
+ return 0;
+}
+
+/* Daemons */
+
+/*
+ * Launch the read side of chunk transfer to or from backing store. The read
+ * endio resubmits the bio as a write to complete the transfer.
+ */
+static void transfer_chunk(struct hook *hook, struct block_device *dev, bio_end_io_t *endio)
+{
+ struct devinfo *info = hook->info;
+ struct page *page = alloc_pages(GFP_KERNEL|__GFP_NOFAIL, 0);
+ struct bio *bio = bio_alloc(GFP_KERNEL|__GFP_NOFAIL, 1);
+ trace(info, ">>> transfer chunk %i\n", hook->start);
+
+ spin_lock(&info->lock);
+ info->inflight++;
+ spin_unlock(&info->lock);
+ bio->bi_sector = hook->start << info->chunk_sector_shift;
+ bio->bi_size = 1 << info->chunkshift;
+ bio->bi_bdev = dev;
+ bio->bi_end_io = endio;
+ bio->bi_vcnt = 1; // !!! should support multiple pages
+ bio->bi_io_vec[0].bv_page = page;
+ bio->bi_io_vec[0].bv_len = PAGE_CACHE_SIZE;
+ bio->bi_private = hook;
+ hook->bio = bio;
+ generic_make_request(bio);
+}
+
+/*
+ * Do not know for sure whether an allocation has to be done to hook an IO
+ * until after taking spinlock but cannot do a slab allocation while
+ * holding a spinlock. Keep one item preallocated so it can be used
+ * while holding a spinlock.
+ */
+static void preallocate_hook_and_lock(struct devinfo *info)
+{
+ struct hook *hook;
+ spinlock_t *lock = &info->lock;
+
+ spin_lock(lock);
+ if (info->prehook)
+ return;
+
+ spin_unlock(lock);
+ hook = alloc_hook();
+ spin_lock(lock);
+ if (info->prehook)
+ kmem_cache_free(ramback_hooks, hook);
+ else
+ info->prehook = hook;
+}
+
+struct hook *consume_hook(struct devinfo *info)
+{
+ struct hook *hook = info->prehook;
+ info->prehook = NULL;
+ return hook;
+}
+
+/*
+ * Block device IO cannot be submitted from interrupt context in which
+ * endio functions normally run, so instead the endio pushes the bio onto
+ * a list to be submitted by a daemon. This submits and empties the list.
+ */
+static int submit_list(struct list_head *submits, spinlock_t *lock)
+{
+ //trace(info, "submit_list\n");
+ spin_lock(lock);
+ while (!list_empty(submits)) {
+ struct list_head *entry = submits->next;
+ struct hook *hook = list_entry(entry, struct hook, queue);
+ struct devinfo *info = hook->info;
+ trace(info, ">>> submit bio %p for chunk %zi, size = %i\n", hook->bio, hook->start, hook->limit - hook->start);
+ list_del(entry);
+ spin_unlock(lock);
+ if (!hook->bio)
+ return 1;
+ if (!hook->bio->bi_size) { /* allocate on behalf of endio */
+ int pages = hook->limit - hook->start, i; // !!! assumes chunk size = page size
+ struct bio *bio = hook->bio, *clone = bio_alloc(__GFP_NOFAIL, pages);
+ trace(info, ">>> clone and submit bio %p for chunk %zi, size = %i\n", hook->bio, hook->start, hook->limit - hook->start);
+ clone->bi_sector = hook->start << info->chunk_sector_shift;
+ clone->bi_size = (hook->limit - hook->start) << info->chunkshift;
+ clone->bi_end_io = bio->bi_end_io;
+ clone->bi_bdev = bio->bi_bdev;
+ clone->bi_vcnt = pages;
+ for (i = 0; i < pages; i++) {
+ struct page *page = alloc_pages(GFP_KERNEL|__GFP_NOFAIL, 0);
+ *(clone->bi_io_vec + i) = (struct bio_vec){ .bv_page = page, .bv_len = PAGE_CACHE_SIZE };
+ }
+ clone->bi_private = hook;
+ hook->cloned = bio;
+ hook->bio = clone;
+ }
+ generic_make_request(hook->bio);
+ spin_lock(lock);
+ }
+ spin_unlock(lock);
+ return 0;
+}
+
+/*
+ * Premature optimization: submissions to the backing device will back up and
+ * block, so there is one daemon that is supposed to avoid blocking by
+ * checking the block device congestion before submitting, and another that
+ * just blocks when the disk is busy, controlling the writeback rate.
+ */
+static int fast_daemon(void *data)
+{
+ struct devinfo *info = data;
+ struct list_head *submits = &info->fast_submits;
+ spinlock_t *lock = &info->lock;
+ trace(info, "fast daemon started\n");
+
+ while (1) {
+ if (submit_list(submits, lock))
+ return 0;
+ trace(info, "fast daemon sleeps\n");
+ wait_event(info->fast_wait, !list_empty(submits));
+ trace(info, "fast daemon wakes\n");
+// if (kthread_should_stop()) {
+// BUG_ON(!list_empty(submits));
+// return 0;
+// }
+ }
+}
+
+/*
+ * Responsible for initial ramdisk loading and dirty chunk writeback
+ */
+static int slow_daemon(void *data)
+{
+ struct devinfo *info = data;
+ struct list_head *submits = &info->slow_submits;
+ disable(trace_off(int die = 0);)
+ spinlock_t *lock = &info->lock;
+ chunk_t chunk = 0, chunks = info->chunks, i;
+ trace(info, "slow daemon started, %i chunks\n", chunks);
+ disable(msleep(1000);)
+
+ while (1) {
+ disable(BUG_ON(++die == 100);)
+ for (i = 0; i < chunks; i++, chunk++) {
+ if (chunk == chunks) {
+ info->populated = 1;
+ chunk = 0;
+ }
+ disable(show_loading(info);)
+ preallocate_hook_and_lock(info);
+ if (!list_empty(submits)) {
+ spin_unlock(lock);
+ if (submit_list(submits, lock))
+ return 0;
+ preallocate_hook_and_lock(info);
+ }
+ BUG_ON(info->flushing > info->dirty);
+ trace(info, "check chunk %i %i\n", chunk, is_dirty(info, chunk));
+ if (info->populated && !info->dirty) {
+ spin_unlock(lock);
+ break;
+ }
+ if (is_dirty(info, chunk)) {
+ struct hook *hook = consume_hook(info);
+ trace(info, ">>> dirty chunk %i\n", chunk);
+ BUG_ON(is_loading(info, chunk));
+ *hook = (struct hook){ .info = info, .start = chunk, .limit = chunk + 1 };
+ list_add_tail(&hook->member, &info->storing);
+ info->dirty -= set_clean_locked(info, hook->start, hook->limit);
+ spin_unlock(lock);
+ transfer_chunk(hook, info->ramdev, save_read_endio);
+ } else if (1 && !info->populated && is_empty(info, chunk) && !is_loading(info, chunk)) {
+ struct hook *hook = consume_hook(info);
+ trace(info, ">>> empty chunk %i\n", chunk);
+ *hook = (struct hook){ .info = info, .start = chunk, .limit = chunk + 1 };
+ list_add_tail(&hook->member, &info->loading);
+ spin_unlock(lock);
+ transfer_chunk(hook, info->backdev, load_read_endio);
+ }
+ if (info->inflight >= info->inflight_max) // !!! racy
+ wait_event(info->slow_wait, info->inflight <= info->inflight_max / 2);
+ }
+ trace(info, "slow daemon sleeps %i %i %i\n", info->inflight, info->dirty, !list_empty(submits));
+ wait_event(info->slow_wait, info->dirty || !list_empty(submits));
+ trace(info, "slow daemon wakes %i %i %i\n", info->inflight, info->dirty, !list_empty(submits));
+ BUG_ON(info->inflight > chunks);
+// if (kthread_should_stop()) {
+// BUG_ON(!list_empty(submits));
+// return 0;
+// }
+ }
+}
+
+/* Device mapper methods */
+
+/*
+ * Map a bio tranfer to the ramdisk. All chunks covered by a read transfer
+ * or zero to two partial chunks covered by a write transfer may need to be
+ * populated before the transfer proceeds. If loading is needed then the
+ * bio goes on the loading list and will be submitted when all covered chunks
+ * have been populated. For now be lazy and populate the inner chunks of
+ * a write even though this is not required.
+ */
+static int ramback_map(struct dm_target *target, struct bio *bio, union map_info *context)
+{
+ struct devinfo *info = target->private;
+ unsigned shift = info->chunk_sector_shift;
+ unsigned sectors_per_chunk = 1 << shift;
+ unsigned sectors = bio->bi_size >> SECTOR_SHIFT;
+ chunk_t start = bio->bi_sector >> shift;
+ chunk_t limit = (bio->bi_sector + sectors + sectors_per_chunk - 1) >> shift;
+ chunk_t chunk;
+ struct hook *hook = NULL;
+ trace(info, ">>> ramback_map bio %p: %Li %i\n", bio, (long long)bio->bi_sector, sectors);
+ bio->bi_bdev = info->ramdev;
+ /*
+ * will need a hook for any write or sometimes for a read if still
+ * loading, but will not know for sure until after taking locks
+ */
+ if (bio_data_dir(bio) == WRITE) {
+ // !!! block until under dirty threshold
+ hook = alloc_hook();
+ *hook = (struct hook){
+ .old_endio = bio->bi_end_io,
+ .old_private = bio->bi_private,
+ .start = start, .limit = limit,
+ .info = info, .bio = bio };
+ bio->bi_end_io = write_endio;
+ bio->bi_private = hook;
+ }
+ if (info->populated)
+ goto simple;
+ preallocate_hook_and_lock(info);
+ for (chunk = start; chunk < limit; chunk++)
+ if (1 && is_empty(info, chunk) && !is_loading(info, chunk))
+ goto populate;
+ spin_unlock(&info->lock);
+ goto simple;
+populate:
+ if (!hook) { /* must be a read */
+ hook = consume_hook(info);
+ *hook = (struct hook){
+ .start = start, .limit = limit,
+ .info = info, .bio = bio };
+ }
+ list_add_tail(&hook->member, &info->loading);
+ spin_unlock(&info->lock);
+
+ for (chunk = start; chunk < limit; chunk++)
+ if (is_empty(info, chunk)) { // !!! locking ???
+ struct hook *hook = alloc_hook();
+ *hook = (struct hook){
+ .start = chunk, .limit = chunk + 1,
+ .info = info };
+ list_add_tail(&hook->member, &info->loading);
+ transfer_chunk(hook, info->backdev, load_read_endio);
+ }
+ // !!! verify at least one populate in flight so bio will progress
+ return 0;
+simple:
+ generic_make_request(bio);
+ return 0;
+}
+
+static void ramback_destroy(struct dm_target *target)
+{
+ struct devinfo *info = target->private;
+ if (!info)
+ return;
+ trace(info, ">>> destroy ramback\n");
+ ramback_proc_remove(target);
+ if (info->fast_daemon) {
+ queue_to_ramdev(&(struct hook){ .info = info });
+ kthread_stop(info->fast_daemon);
+ }
+ trace(info, "fast daemon stopped\n");
+ if (info->slow_daemon) {
+ queue_to_backdev(&(struct hook){ .info = info });
+ kthread_stop(info->slow_daemon);
+ }
+ trace(info, "slow daemon stopped\n");
+ if (info->ramdev)
+ dm_put_device(target, info->dm_ramdev);
+ if (info->backdev)
+ dm_put_device(target, info->dm_backdev);
+ if (info->state)
+ vfree(info->state);
+ kfree(info);
+}
+
+static int open_device(struct dm_target *target, char *name, struct dm_dev **result)
+{
+ int mode = dm_table_get_mode(target->table);
+ return dm_get_device(target, name, 0, 0, mode, result);
+}
+
+struct hash_cell {
+ struct list_head name_list;
+ struct list_head uuid_list;
+
+ char *name;
+ char *uuid;
+ struct mapped_device *md;
+ struct dm_table *new_map;
+};
+
+static int ramback_create(struct dm_target *target, unsigned argc, char **argv)
+{
+ struct devinfo *info;
+ int chunkshift = 12, chunk_sector_shift, statebytes, err;
+ chunk_t chunks;
+ char *error;
+
+ error = "ramback usage: ramdev backdev [chunkshift]";
+ err = -EINVAL;
+ if (argc < 2)
+ goto fail;
+
+ if (argc > 2)
+ chunkshift = simple_strtol(argv[2], NULL, 0);
+
+ chunk_sector_shift = chunkshift - SECTOR_SHIFT;
+ chunks = (target->len + (1 << chunk_sector_shift) - 1) >> chunk_sector_shift;
+ statebytes = (chunks + 3) >> 2;
+ err = -ENOMEM;
+ error = "get kernel memory failed";
+ if (!(info = kmalloc(sizeof(struct devinfo), GFP_KERNEL)))
+ goto fail;
+ *info = (struct devinfo){
+ .inflight_max = 50,
+ //.flags = FLUSH_FLAG, // test it
+ .chunkshift = chunkshift, .chunks = chunks,
+ .chunk_sector_shift = chunk_sector_shift,
+ .fast_submits = LIST_HEAD_INIT(info->fast_submits),
+ .slow_submits = LIST_HEAD_INIT(info->slow_submits),
+ .thru_queue = LIST_HEAD_INIT(info->thru_queue),
+ .loading = LIST_HEAD_INIT(info->loading),
+ .storing = LIST_HEAD_INIT(info->storing)};
+
+ init_waitqueue_head(&info->slow_wait);
+ init_waitqueue_head(&info->fast_wait);
+ target->private = info;
+
+ trace(info, ">>> create ramback, chunks = %i\n", chunks);
+ if (!(info->state = vmalloc(statebytes)))
+ goto fail;
+ memset(info->state, 0, statebytes);
+ error = "get ramdisk device failed";
+ if ((err = open_device(target, argv[0], &info->dm_ramdev)))
+ goto fail;
+ info->ramdev = info->dm_ramdev->bdev;
+
+ error = "get backing device failed";
+ if ((err = open_device(target, argv[1], &info->dm_backdev)))
+ goto fail;
+ info->backdev = info->dm_backdev->bdev;
+
+ error = "start daemon failed";
+ if (!(info->fast_daemon = kthread_run(fast_daemon, info, "ramback-fast")))
+ goto fail;
+
+ if (!(info->slow_daemon = kthread_run(slow_daemon, info, "ramback-slow")))
+ goto fail;
+
+ ramback_proc_create(target, target->private);
+ return 0;
+fail:
+ ramback_destroy(target);
+ target->error = error;
+ return err;
+}
+
+static int ramback_status(struct dm_target *target, status_type_t type, char *result, unsigned maxlen)
+{
+ switch (type) {
+ case STATUSTYPE_INFO:
+ snprintf(result, maxlen, "nothing here yet");
+ break;
+ case STATUSTYPE_TABLE:
+ snprintf(result, maxlen, "nothing here yet");
+ break;
+ }
+ return 0;
+}
+
+static struct target_type ramback = {
+ .name = "ramback",
+ .version = { 0, 0, 0 },
+ .module = THIS_MODULE,
+ .ctr = ramback_create,
+ .dtr = ramback_destroy,
+ .map = ramback_map,
+ .status = ramback_status,
+};
+
+static void ramback_error(char *action, int err)
+{
+ printk(KERN_ERR "ramback: %s failed (error %i)", action, err);
+}
+
+void ramback_exit(void)
+{
+ int err;
+ char *action;
+
+ action = "unregister target";
+ err = dm_unregister_target(&ramback);
+ if (ramback_hooks)
+ kmem_cache_destroy(ramback_hooks);
+ action = "remove proc entry";
+ remove_proc_entry("ramback", proc_root_driver);
+ if (!err)
+ return;
+ ramback_error(action, -err);
+}
+
+int __init ramback_init(void)
+{
+ int err = -ENOMEM;
+ char *action;
+
+ action = "create caches";
+ if (!(ramback_hooks = kmem_cache_create("ramback-hooks",
+ sizeof(struct hook), __alignof__(struct hook), 0, NULL)))
+ goto fail;
+ action = "register ramback driver";
+ if ((err = dm_register_target(&ramback)))
+ goto fail;
+ action = "create ramback proc entry";
+ if (!(ramback_proc_root = proc_mkdir("ramback", proc_root_driver)))
+ goto fail;
+ ramback_proc_root->owner = THIS_MODULE;
+ return 0;
+fail:
+ ramback_error(action, -err);
+ ramback_exit();
+ return err;
+}
+
+module_init(ramback_init);
+module_exit(ramback_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Violin Memory Inc");
--- 2.6.23.12.base/drivers/md/Makefile 2008-02-14 02:01:41.000000000 -0800
+++ 2.6.23.12/drivers/md/Makefile 2008-03-04 21:32:09.000000000 -0800
@@ -39,6 +39,7 @@ obj-$(CONFIG_DM_MULTIPATH_RDAC) += dm-rd
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o
+obj-y += dm-ramback.o
quiet_cmd_unroll = UNROLL $@
cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \
On Sun, 9 Mar 2008, Daniel Phillips wrote:
> Every little factor of 25 performance increase really helps.
>
> Ramback is a new virtual device with the ability to back a ramdisk
> by a real disk, obtaining the performance level of a ramdisk but with
> the data durability of a hard disk. To work this magic, ramback needs
> a little help from a UPS. In a typical test, ramback reduced a 25
> second file operation[1] to under one second including sync. Even
> greater gains are possible for seek-intensive applications.
>
> The difference between ramback and an ordinary ramdisk is: when the
> machine powers down the data does not vanish because it is continuously
> saved to backing store. When line power returns, the backing store
> repopulates the ramdisk while allowing application io to proceed
> concurrently. Once fully populated, a little green light winks on and
> file operations once again run at ramdisk speed.
>
> So now you can ask some hard questions: what if the power goes out
> completely or the host crashes or something else goes wrong while
> critical data is still in the ramdisk? Easy: use reliable components.
> Don't crash. Measure your UPS window. This is not much to ask in
> order to transform your mild mannered hard disk into a raging superdisk
> able to leap tall benchmarks at a single bound.
>
> If line power goes out while ramback is running, the UPS kicks in and a
> power management script switches the driver from writeback to
> writethrough mode. Ramback proceeds to save all remaining dirty data
> while forcing each new application write through to backing store
> immediately.
>
> If UPS power runs out while ramback still holds unflushed dirty data
> then things get ugly. Hopefully a fsck -f will be able to pull
> something useful out of the mess. (This is where you might want to be
> running Ext3.) The name of the game is to install sufficient UPS power
> to get your dirty ramdisk data onto stable storage this time, every
> time.
Are you using barriers or ordered disk writes with physical sync in the
right moments or something like that? I think this is needed to allow any
journaling filesystem to do it's job right.
> The basic design premise of ramback is alluringly simple: each write to
> a ramdisk sets a per-chunk dirty bit. A kernel daemon continuously
> scans for and flushes dirty chunks to backing store.
Thanks,
GK
On Monday 10 March 2008 00:51, Grzegorz Kulewski wrote:
> > If UPS power runs out while ramback still holds unflushed dirty data
> > then things get ugly. Hopefully a fsck -f will be able to pull
> > something useful out of the mess. (This is where you might want to be
> > running Ext3.) The name of the game is to install sufficient UPS power
> > to get your dirty ramdisk data onto stable storage this time, every
> > time.
>
> Are you using barriers or ordered disk writes with physical sync in the
> right moments or something like that? I think this is needed to allow any
> journaling filesystem to do it's job right.
Usual block device semantics are preserved so long as UPS power does
not run out before emergency writeback completes. It is not possible
to order writes to the backing store and still deliver ramdisk level
write latency to the application.
After the emergency writeback completes, ramback is supposed to
behave just like a physical disk (with respect to writes - reads will
still have ramdisk level latency). No special support is provided
for barriers. It is not clear that anything special is needed.
Daniel
> So now you can ask some hard questions: what if the power goes out
> completely or the host crashes or something else goes wrong while
> critical data is still in the ramdisk? Easy: use reliable components.
Nice fiction - stuff crashes eventually - not that this isn't useful. For
a long time simply loading a 2-3GB Ramdisk off hard disk has been a good
way to build things like compile engines where loss of state is not bad.
> If UPS power runs out while ramback still holds unflushed dirty data
> then things get ugly. Hopefully a fsck -f will be able to pull
> something useful out of the mess. (This is where you might want to be
> running Ext3.) The name of the game is to install sufficient UPS power
> to get your dirty ramdisk data onto stable storage this time, every
> time.
Ext3 is only going to help you if the ramdisk writeback respects barriers
and ordering rules ?
> * Previously saved data must be reloaded into the ramdisk on startup.
/bin/cp from initrd
> * Cannot transfer directly between ramdisk and backing store, so must
> first transfer into memory then relaunch to destination.
Why not - providing you clear the dirty bit before the write and you
check it again after ? And on the disk size as you are going to have to
suck all the content back in presumably a log structure is not a big
concern ?
> * Per chunk locking is not feasible for a terabyte scale ramdisk.
And we care 8) ?
> * Handle chunk size other than PAGE_SIZE.
If you are prepared to go bigger than the fs chunk size so lose the
ordering guarantees your chunk size really ought to be *big* IMHO
Alan
> Usual block device semantics are preserved so long as UPS power does
> not run out before emergency writeback completes. It is not possible
> to order writes to the backing store and still deliver ramdisk level
> write latency to the application.
Why - your chunks simply become a linked list in write barrier order.
Solve your bitmap sweep cost as well. As you are already making a copy
before going to backing store you don't have the internal consistency
problems of further writes during the I/O.
Yes you may need to throttle in the specific case of having too many
copies of pages sitting in the queue - but surely that would be the set of
pages that are written but not yet committed from a previous store
barrier ?
BTW: I'm also curious why you made it a block device. What does that
offer over say ramfs + dnotify and a userspace daemon or perhaps for big
files to work smoothly a ramfs variant that keeps dirty bitmaps on file
pages. That way write back would be file level and while you might lose
changesets that have not been fsync()'d your underlying disk fs would
always be coherent.
Alan
Daniel Phillips wrote:
> Every little factor of 25 performance increase really helps.
>
> Ramback is a new virtual device with the ability to back a ramdisk
> by a real disk, obtaining the performance level of a ramdisk but with
> the data durability of a hard disk. To work this magic, ramback needs
> a little help from a UPS.
So you apparently want three things:
a) ignoring fsync() and co on this device
b) disabling all write throttling on this device
c) never discarding cached data from this device
anything else i'm missing?
Alan already suggested the ramfs+writeback thread approach (possibly
with a little bit of help from the fs which could report just the dirty
regions), but i'm not sure even that is necessary.
(a) can be easily done (fixing the app, LD_PRELOAD or fs extension etc)
(b) couldn't the per-device write throttling be used to achieve this?
(c) shouldn't be impossible either, eg sticking PG_writeback comes to mind,
just the mm accounting needs to remain sane.
IOW can't this be done in a more generic way (and w/o a ramdisk in the
middle)?
> a little help from a UPS. In a typical test, ramback reduced a 25
> second file operation[1] to under one second including sync. Even
> greater gains are possible for seek-intensive applications.
apples to oranges. what are the numbers for a nonjournalled disk-backed
fs and _without_ the sync? (You're not committing to stable storage anyway
so the sync is useless and if you don't respect the ordering so is the
journal)
artur
Daniel Phillips wrote:
> Don't crash.
So that's what I've been doing wrong for all these years...
-- Chris
On Mon, 10 Mar 2008 09:22:13 +0000
Alan Cox <[email protected]> wrote:
> > So now you can ask some hard questions: what if the power goes out
> > completely or the host crashes or something else goes wrong while
> > critical data is still in the ramdisk? Easy: use reliable components.
>
> Nice fiction - stuff crashes eventually - not that this isn't useful. For
> a long time simply loading a 2-3GB Ramdisk off hard disk has been a good
> way to build things like compile engines where loss of state is not bad.
>
> > If UPS power runs out while ramback still holds unflushed dirty data
> > then things get ugly. Hopefully a fsck -f will be able to pull
> > something useful out of the mess. (This is where you might want to be
> > running Ext3.) The name of the game is to install sufficient UPS power
> > to get your dirty ramdisk data onto stable storage this time, every
> > time.
>
> Ext3 is only going to help you if the ramdisk writeback respects barriers
> and ordering rules ?
That could get ugly when ext3 has written to the same block multiple
times. To get some level of consistency, ramback would need to keep
around the different versions and flush them in order.
--
All rights reversed.
Hi Alan,
Nice to see so many redhatters taking an avid interest in storage :-)
On Monday 10 March 2008 02:22, Alan Cox wrote:
> > So now you can ask some hard questions: what if the power goes out
> > completely or the host crashes or something else goes wrong while
> > critical data is still in the ramdisk? Easy: use reliable components.
>
> Nice fiction - stuff crashes eventually - not that this isn't useful. For
> a long time simply loading a 2-3GB Ramdisk off hard disk has been a good
> way to build things like compile engines where loss of state is not bad.
Right, and now with ramback you will be able to preserve that state and
have the performance too. It is a wonderful world.
> > If UPS power runs out while ramback still holds unflushed dirty data
> > then things get ugly. Hopefully a fsck -f will be able to pull
> > something useful out of the mess. (This is where you might want to be
> > running Ext3.) The name of the game is to install sufficient UPS power
> > to get your dirty ramdisk data onto stable storage this time, every
> > time.
>
> Ext3 is only going to help you if the ramdisk writeback respects barriers
> and ordering rules ?
I was alluding to to e2fsck's amazing repair ability, not ext3's journal.
> > * Previously saved data must be reloaded into the ramdisk on startup.
>
> /bin/cp from initrd
But that does not satisfy the requirement you snipped:
* Applications need to be able to read and write ramback data during
initial loading.
> > * Cannot transfer directly between ramdisk and backing store, so must
> > first transfer into memory then relaunch to destination.
>
> Why not - providing you clear the dirty bit before the write and you
> check it again after ? And on the disk size as you are going to have to
More accurately: in general, cannot transfer directly. The ramdisk may
be external and not present a memory interface. Even an external
ramdisk with a memory interface (the Violin box has this) would require
extra programming to maintain cache consistency. Then there is the
issue of ramdisks on the way that exceed the 40 bit physical addressing
of current generation processors.
Even for the simple case where the ramdisk is just part of the kernel
unified cache, I would rather not go delving into that code when these
transfers are on the slow path anyway. Application IO does its normal
single copy_to/from_user thing. If somebody wants to fiddle with vm,
the place to attack is right there. The copy_to/from_user can be
eliminated (provided alignment requirements are met) using stupid page
table tricks. In spite of Linus claiming there is no performance win
to be had, I would like to see that put to the test.
> suck all the content back in presumably a log structure is not a big
> concern ?
Sorry, I failed to parse that.
> > * Per chunk locking is not feasible for a terabyte scale ramdisk.
>
> And we care 8) ?
"640K should be enough for anyone"
http://www.violin-memory.com/products/violin1010.html <- 504 GB ramdisk
> > * Handle chunk size other than PAGE_SIZE.
>
> If you are prepared to go bigger than the fs chunk size so lose the
> ordering guarantees your chunk size really ought to be *big* IMHO
The finer the granularity the faster the ramdisk syncs to backing
store. The only attraction of coarse granularity I know of is
shrinking the bitmap, which is currently not so big that it presents
a problem.
Your comment re fs chunk size reveals that I have failed to
communicate the most basic principle of the ramback design: the
backing store is not expected to represent a consistent filesystem
state during normal operation. Only the ramdisk needs to maintain a
consistent state, which I have taken care to ensure. You just need
to believe in your battery, Linux and the hardware it runs on. Which
of these do you mistrust?
Regards,
Daniel
On Monday 10 March 2008 02:37, Alan Cox wrote:
> > Usual block device semantics are preserved so long as UPS power does
> > not run out before emergency writeback completes. It is not possible
> > to order writes to the backing store and still deliver ramdisk level
> > write latency to the application.
>
> Why - your chunks simply become a linked list in write barrier order.
Good idea, it would be nice to offer that operating mode. But linear
sweeping is going to put the most data onto rotating media the fastest,
thus making the loss-of-line-power flush window as small as possible,
which is what the current incarnation of this driver optimizes for.
Note that half a TB worth of dirty ramdisk chunks will need 1 GB of
linked list storage, so this imposes a limit on total dirty data for
practical purposes.
> Solve your bitmap sweep cost as well.
What happens with the linked list gets too long? Fall back to linear
sweep? Or accept suboptimal write caching?
A linked list would work for linking together dirty bitmap pages, one
level up, thus 2**15 rarer. Even there I prefer the linear sweep. I
intend to implement a dirty map of the dirty map, at least because I
have not seen one of those before, but also because I think it will
perform well.
> As you are already making a copy
> before going to backing store you don't have the internal consistency
> problems of further writes during the I/O.
Indeed. That is the entire reason I did it that way. In fact ramback
used to write the ramdisk and backing store from the same application
source, so the writethrough code was significantly shorter. But not
correct...
> Yes you may need to throttle in the specific case of having too many
> copies of pages sitting in the queue - but surely that would be the set of
> pages that are written but not yet committed from a previous store
> barrier ?
The only time ramback cares about barriers is when it switches to
writethrough mode. It would be nice to have a mode where barriers are
respected at the backing store level, but there is no way you will get
the same write performance. The central idea here is that ramback
relies on a UPS to achieve the ultimate in disk performance. I agree
that other modes would be very nice, but not necessary for this thing
to be actually useful. I suspect than early users will be looking for
all the performance that can get.
> BTW: I'm also curious why you made it a block device. What does that
> offer over say ramfs + dnotify and a userspace daemon or perhaps for big
> files to work smoothly a ramfs variant that keeps dirty bitmaps on file
> pages.
As a block device it is very flexible, and as a block device it is
fairly simple. As a block device, the only interesting userspace setup
is the hookup to power management scripts. Dnotify... probably you
meant inotify, and even then it sounds daunting, but maybe somebody
can prove me wrong.
> That way write back would be file level and while you might lose
> changesets that have not been fsync()'d your underlying disk fs would
> always be coherent.
That would a nice hack, why not take a run at it?
Regards,
Daniel
On Monday 10 March 2008 12:01, Rik van Riel wrote:
> > Ext3 is only going to help you if the ramdisk writeback respects barriers
> > and ordering rules ?
>
> That could get ugly when ext3 has written to the same block multiple
> times. To get some level of consistency, ramback would need to keep
> around the different versions and flush them in order.
Ah, keep snapshots like ddsnap? Interesting idea. But complex, and
ramback will stay perfectly consistent so long as you don't pull the
plug on your UPS. I seem to recall that EMC has been peddling SAN
storage with similar restrictions for quite some time now.
Regards,
Daniel
On Sun, Mar 09, 2008 at 10:46:16PM -0800, Daniel Phillips wrote:
> Set ramback to flush mode:
>
> echo 1 >/proc/driver/ramback/<devname>
/proc is so 1990's. As your code has nothing to do with processes,
please don't add new files in /proc/. sysfs is there for you to do
stuff like this.
> Show ramback status:
>
> cat /proc/driver/ramback/<devname>
>
> Progress monitor:
>
> watch -n1 cat /proc/driver/ramback/<devname>
Use debugfs for stuff like debug info like this.
thanks,
greg k-h
On Monday 10 March 2008 22:06, Greg KH wrote:
> On Sun, Mar 09, 2008 at 10:46:16PM -0800, Daniel Phillips wrote:
> > Set ramback to flush mode:
> >
> > echo 1 >/proc/driver/ramback/<devname>
>
> /proc is so 1990's. As your code has nothing to do with processes,
> please don't add new files in /proc/. sysfs is there for you to do
> stuff like this.
Demonstrate some advantage and I will think about it.
Daniel
On Mon, 10 Mar 2008, Daniel Phillips wrote:
> On Monday 10 March 2008 22:06, Greg KH wrote:
>> On Sun, Mar 09, 2008 at 10:46:16PM -0800, Daniel Phillips wrote:
>>> Set ramback to flush mode:
>>>
>>> echo 1 >/proc/driver/ramback/<devname>
>>
>> /proc is so 1990's. As your code has nothing to do with processes,
>> please don't add new files in /proc/. sysfs is there for you to do
>> stuff like this.
>
> Demonstrate some advantage and I will think about it.
use of /proc is discouraged, if you insist on sticking with it in the face
of opposition you will seriously hurt the chance of your patches being
accepted.
David Lang
On Mon, Mar 10, 2008 at 09:22:14PM -0800, Daniel Phillips wrote:
> On Monday 10 March 2008 22:06, Greg KH wrote:
> > On Sun, Mar 09, 2008 at 10:46:16PM -0800, Daniel Phillips wrote:
> > > Set ramback to flush mode:
> > >
> > > echo 1 >/proc/driver/ramback/<devname>
> >
> > /proc is so 1990's. As your code has nothing to do with processes,
> > please don't add new files in /proc/. sysfs is there for you to do
> > stuff like this.
>
> Demonstrate some advantage and I will think about it.
Again, as your code has nothing to do with "processes", please do not
add new files to /proc.
As you are a filesystem, why not /sys/fs/ ?
It ends up with smaller code than procfs stuff as well, a good and nice
advantage.
thanks,
greg k-h
On 2008-03-10T09:37:37, Alan Cox <[email protected]> wrote:
> Why - your chunks simply become a linked list in write barrier order.
> Solve your bitmap sweep cost as well. As you are already making a copy
> before going to backing store you don't have the internal consistency
> problems of further writes during the I/O.
You get duplicated blocks though. But yes, I agree - write-backs to the
disk must be ordered, other it's going to be too unreliable in practice.
An in-memory buffer for a log-structured block device.
> Yes you may need to throttle in the specific case of having too many
> copies of pages sitting in the queue - but surely that would be the set of
> pages that are written but not yet committed from a previous store
> barrier ?
You could switch from a journal like the above to a bitmap when this
overrun occurs. (Typical problem in replication.) SteelEye holds a
patent on that though, as far as I know.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Hi Lars,
On Monday 10 March 2008 14:03, Lars Marowsky-Bree wrote:
> On 2008-03-10T09:37:37, Alan Cox <[email protected]> wrote:
> > Why - your chunks simply become a linked list in write barrier order.
> > Solve your bitmap sweep cost as well. As you are already making a copy
> > before going to backing store you don't have the internal consistency
> > problems of further writes during the I/O.
>
> You get duplicated blocks though. But yes, I agree - write-backs to the
> disk must be ordered, other it's going to be too unreliable in practice.
I disagree with your claim of "too unreliable". If the UPS power does
not fail before flushing completes, it is perfectly reliable. Perhaps
you need a belt to go with your suspenders?
As I wrote earlier, you cannot have optimal writeback speed and ordering
at the same time. I can see eventually implementing some kind of ordered
writeback mode where completion is signalled to the application before
writeback completes. You then get to choose between fastest flush and
most paranoid ordering. I guess everybody will choose fastest flush,
but I will be happy to accept your patch to see which they actually
choose.
> > Yes you may need to throttle in the specific case of having too many
> > copies of pages sitting in the queue - but surely that would be the set of
> > pages that are written but not yet committed from a previous store
> > barrier ?
>
> You could switch from a journal like the above to a bitmap when this
> overrun occurs. (Typical problem in replication.) SteelEye holds a
> patent on that though, as far as I know.
If you think this is like replication then you have the wrong idea
about what is going on. This is a cache consistency algorithm, not
a replication algorithm.
Regards,
Daniel
On 2008-03-11T03:14:40, Daniel Phillips <[email protected]> wrote:
> > You get duplicated blocks though. But yes, I agree - write-backs to the
> > disk must be ordered, other it's going to be too unreliable in practice.
> I disagree with your claim of "too unreliable". If the UPS power does
> not fail before flushing completes, it is perfectly reliable. Perhaps
> you need a belt to go with your suspenders?
Daniel, I'm not saying you don't have a good thing here. Just that for
backing large filesystems, the risk of having to run a full fsck and
finding inconsistent metadata is pretty serious.
If I always assume a reliable shutdown - UPS protected, no crashes, etc
- you're right, but at least my real world has other failure scenarios
as well. In fact, the most common reason for unorderly shutdowns are
kernel crashes, not power failures in my experience.
So "perfectly reliable if UPS power does not fail" seems a bit over the
top.
> As I wrote earlier, you cannot have optimal writeback speed and ordering
> at the same time.
No disagreement here. The question would be how large the performance
difference is.
> I can see eventually implementing some kind of ordered
> writeback mode where completion is signalled to the application before
> writeback completes. You then get to choose between fastest flush and
> most paranoid ordering. I guess everybody will choose fastest flush,
> but I will be happy to accept your patch to see which they actually
> choose.
I was trying to prod you into writing the ordered flushing. Maybe
claiming it is too hard will do the trick? ;-)
Seriously, I guess it depends on the workload you want to host.
> > > Yes you may need to throttle in the specific case of having too many
> > > copies of pages sitting in the queue - but surely that would be the set of
> > > pages that are written but not yet committed from a previous store
> > > barrier ?
> > You could switch from a journal like the above to a bitmap when this
> > overrun occurs. (Typical problem in replication.) SteelEye holds a
> > patent on that though, as far as I know.
> If you think this is like replication then you have the wrong idea
> about what is going on. This is a cache consistency algorithm, not
> a replication algorithm.
I see the differences, but I also see the similarities. What you're
doing can also be thought of as replicating from an instant IO store
(local memory) to a high latency, low bandwidth copy (the disk)
asynchronously.
Both obviously need to preserve consistency, the question is whether to
achieve transactional (ordered) consistency or not.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
On Tuesday 11 March 2008 04:23, Lars Marowsky-Bree wrote:
> If I always assume a reliable shutdown - UPS protected, no crashes, etc
> - you're right, but at least my real world has other failure scenarios
> as well. In fact, the most common reason for unorderly shutdowns are
> kernel crashes, not power failures in my experience.
What are you doing to your kernel?
My desktop at home, which runs my MTA, web site etc, and is subject to
regular oom abuse by firefox:
$ uptime
04:36:49 up 204 days, 9:38, 8 users, load average: 0.82, 0.43, 0.20
This machine has been up ever since I got a UPS for it, and then it
only ever went down due to a blackout or my wife blowing a fuse with
the vacuum cleaner. Honestly, I have never seen a machine running
Linux 2.6 crash due to a software flaw, except when I caused it
myself. I suspect the Linux kernel has a better MTBF than a hard
disk.
> So "perfectly reliable if UPS power does not fail" seems a bit over the
> top.
It works for EMC :-)
> I was trying to prod you into writing the ordered flushing. Maybe
> claiming it is too hard will do the trick? ;-)
>
> Seriously, I guess it depends on the workload you want to host.
I agree this would be a nice option to have.
> > If you think this is like replication then you have the wrong idea
> > about what is going on. This is a cache consistency algorithm, not
> > a replication algorithm.
>
> I see the differences, but I also see the similarities. What you're
> doing can also be thought of as replicating from an instant IO store
> (local memory) to a high latency, low bandwidth copy (the disk)
> asynchronously.
>
> Both obviously need to preserve consistency, the question is whether to
> achieve transactional (ordered) consistency or not.
In fact, replicating was one of the strategies I considered for this.
But since it is a lot more work and will not perform as well as a
simple sweep, I opted for the simple thing. Which turned out to
be pretty complex anyway. You have to close all the same nasty
races but with a considerably more complex base algorithm. I think
that better wait for version 2.0.
By the way, I could use a hand debugging this thing.
Regards,
Daniel
Daniel Phillips wrote:
>
> Right, and now with ramback you will be able to preserve that state and
> have the performance too. It is a wonderful world.
>
>>> If UPS power runs out while ramback still holds unflushed dirty data
>>> then things get ugly. Hopefully a fsck -f will be able to pull
>>> something useful out of the mess. (This is where you might want to be
>>> running Ext3.) The name of the game is to install sufficient UPS power
>>> to get your dirty ramdisk data onto stable storage this time, every
>>> time.
>> Ext3 is only going to help you if the ramdisk writeback respects barriers
>> and ordering rules ?
> extra programming to maintain cache consistency. Then there is the
> issue of ramdisks on the way that exceed the 40 bit physical addressing
> of current generation processors.
>>> * Per chunk locking is not feasible for a terabyte scale ramdisk.
> backing store is not expected to represent a consistent filesystem
> state during normal operation. Only the ramdisk needs to maintain a
> consistent state, which I have taken care to ensure. You just need
> to believe in your battery, Linux and the hardware it runs on. Which
> of these do you mistrust?
expecting the hw to never fail is unreasonable - it will. it's just a
question what happens when (not if) it fails.
and it's not about the backing store being inconsistent during normal
operation - it's about what you are left with after an unclean shutdown.
With your scheme the only time you can trust the on-disk data is when
the device is off; when it fails for some reason (batteries do fail,
kernel bugs do happen, DOS, overheating etc etc) you can no longer
trust any of the data, and no - fsck doesn't help when you have a mix
of old data overwritten by new stuff in basically random order. i can't
see any scenario when it would make sense to trust the corrupted on-disk
fs instead of restoring from backup (or regenerating). So is it just
about avoiding repopulating the fs in the (likely) case of normal,
clean shutdown? This could be a reasonable application of ramback (OTOH
how often will this (shutdown) happen in practice...). IOW you get
a ramdisk-based (ie fast) device that is capable of surviving power loss,
but that's about it.
Now, if you add snapshots to the backing store it suddenly becomes much
more interesting -- you no longer need to put so much trust in all the
hw. Should the device fail for whatever reason then you just rollback to
the last good snapshot upon restart. No corrupted fs, no fsck; you lose
some newly written data (that you couldn't recover w/o a snapshot anyway),
but can trust the rest of it (assuming you trust the fs and storage hw,
but that's no different then w/o ramback).
artur
Artur Skawina wrote:
> Now, if you add snapshots to the backing store it suddenly becomes much
> more interesting -- you no longer need to put so much trust in all the
> hw. Should the device fail for whatever reason then you just rollback to
> the last good snapshot upon restart. No corrupted fs, no fsck; you lose
> some newly written data (that you couldn't recover w/o a snapshot anyway),
> but can trust the rest of it (assuming you trust the fs and storage hw,
> but that's no different then w/o ramback).
or you could keep two devices as backing store, use one and switch to the
other when the fs is consistent. This could as simple as noticing zero
dirty data in the ramdisk or, if something is constantly writing to it,
reacting periodically to some barrier (needs cow/doublebuffering in order
to not throttle the writer, but you already do this). Means ramdisk can
be as large as 1/2 the stable storage and a bit more i/o (resyncing after
switch to the other device), but gives you two copies of the data; one
stable and one that can be used to recover newer data should you need to.
artur
Daniel Phillips wrote:
> On Tuesday 11 March 2008 04:23, Lars Marowsky-Bree wrote:
>
>>If I always assume a reliable shutdown - UPS protected, no crashes, etc
>>- you're right, but at least my real world has other failure scenarios
>>as well. In fact, the most common reason for unorderly shutdowns are
>>kernel crashes, not power failures in my experience.
>
>
> What are you doing to your kernel?
<snip>
> Honestly, I have never seen a machine running
> Linux 2.6 crash due to a software flaw, except when I caused it
> myself. I suspect the Linux kernel has a better MTBF than a hard
> disk.
I have experienced many 2.6 crashes due to software flaws. Hung
processes leading to watchdog timeouts, bad kernel pointers, kernel
deadlock, etc.
When designing for reliable embedded systems it's not enough to handwave
away the possibility of software flaws.
Chris
On Tuesday 11 March 2008 10:26, Chris Friesen wrote:
> I have experienced many 2.6 crashes due to software flaws. Hung
> processes leading to watchdog timeouts, bad kernel pointers, kernel
> deadlock, etc.
>
> When designing for reliable embedded systems it's not enough to handwave
> away the possibility of software flaws.
Indeed. You fix them. Which 2.6 kernel version failed for you, what
made it fail, and does the latest version still fail?
If Linux is not reliable then we are doomed and I do not care about
whether my ramback fails because I will just slit my wrists anyway.
How about you?
Daniel
On Tue, Mar 11, 2008 at 11:56:50AM -0800, Daniel Phillips wrote:
> On Tuesday 11 March 2008 10:26, Chris Friesen wrote:
> > I have experienced many 2.6 crashes due to software flaws. Hung
> > processes leading to watchdog timeouts, bad kernel pointers, kernel
> > deadlock, etc.
> >
> > When designing for reliable embedded systems it's not enough to handwave
> > away the possibility of software flaws.
>
> Indeed. You fix them. Which 2.6 kernel version failed for you, what
> made it fail, and does the latest version still fail?
Daniel, you're not objective. Simply look at LKML reports. People even
report failures with the stable branch, and that's quite expected when
more than 5000 patches are merged in two weeks. The question could even
be returned to you: what kernel are you using to keep 204 days of uptime
doing all that you describe ? Maybe 2.6.16.x would be fine, but even then,
a lot of security issues have been fixed since the last 204 days (most of
which would require a planned reboot), and also a number of normal bugs,
some of which may cause unexpected downtime.
> If Linux is not reliable then we are doomed and I do not care about
> whether my ramback fails because I will just slit my wrists anyway.
No, I've looked at the violin-memory appliance, and I understand better
your goal. Trying to get a secure and reliable generic kernel is a lost
game. However, building a very specific kernel for an appliance is often
quite achievable, because you know exactly the hardware, usage patterns,
etc... which help you stabilize it. I think people who don't agree with
you are simply thinking about a generic file server. While I would not
like my company's NFS exports to rely on such a technology for the same
concerns as exposed here, I would love to have such a beast for logs
analysis, build farms, or various computations which require more than
an average PC's RAM, and would benefit from the data to remain consistent
across planned downtime.
If you consider that the risk of a crash is 1/year and that you have to
work one day to rebuild everything in case of a crash, it is certainly
worth using this technology for many things. But if you consider that
your data cannot suffer a loss even at a rate of 1/year, then you have
to use something else.
BTW, I would say that IMHO nothing here makes RAID impossible to use :-)
Just wire 2 of these beasts to a central server with 10 Gbps NICs and
you have a nice server :-)
Regards,
Willy
On 2008-03-11T03:50:18, Daniel Phillips <[email protected]> wrote:
> > as well. In fact, the most common reason for unorderly shutdowns are
> > kernel crashes, not power failures in my experience.
> What are you doing to your kernel?
I guess I'm being really vicious to them: I expose it to customers and
the real world.
My own servers also have uptimes of >400 days sometimes, and I wonder
what customers do to the poor things.
And yes, I'm not saying I don't see your point for specialised
deployments (filesystems which are easy to rebuild from scratch), but
transactional integrity is a requirement I'd rank really high on the
desirable list of features if I was you.
> > So "perfectly reliable if UPS power does not fail" seems a bit over the
> > top.
> It works for EMC :-)
Where they control the hardware and run a rather specialized OS as well,
not a general purpose system like Linux on "commodity" hardware ;-)
> In fact, replicating was one of the strategies I considered for this.
> But since it is a lot more work and will not perform as well as a
> simple sweep, I opted for the simple thing. Which turned out to
> be pretty complex anyway. You have to close all the same nasty
> races but with a considerably more complex base algorithm. I think
> that better wait for version 2.0.
Ok, I see.
> By the way, I could use a hand debugging this thing.
I'm afraid with those properties it doesn't really meet my needs :-(
And, wouldn't a simpler way to achieve something similar not be to use
the plain Linux fs caching/buffers, just disabling forced write out
maybe via a mount option? This strikes me as similar to the effect I get
from remounting NFS (a)sync. Make the fs ignore fsync et al.
It would have the advantage of using all memory available for caching
and not otherwise requested, too. (And, of course, the downside of
making it hard to reserve cache space for a given fs explicitly, at
least now. But I'm sure the control group / container folks would love
that feature. ;-)
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
On Tuesday 11 March 2008 14:56, Lars Marowsky-Bree wrote:
> And yes, I'm not saying I don't see your point for specialised
> deployments (filesystems which are easy to rebuild from scratch), but
> transactional integrity is a requirement I'd rank really high on the
> desirable list of features if I was you.
It is ranked high, nonetheless I perceive this spate of sky-is-falling
comments as low level FUD. Which of the following do you think is the
least reliable component in your transactional system:
1) Linux
2) The computer
3) The hard disk
4) The battery
5) The fan
Correct answer: the fan. The rest are roughly a tie (though of course
you will find variations) and depend on how much money you spend on
each of them. I know I do not have to explain this to you, but the way
you calculate reliability for a complete system is to multiply the
reliability of each component. The number of nines that drop out of
this calculation is your reliability.
At the moment, the version of Linux I run is looking like a 1.0, so is
the UPS. Already got me through about 4 blackouts and half a dozen
vacuum cleaner events. Though obviously neither is 1.0, both are darn
close. The hard disk on the other hand... I have a box full of broken
ones here, how about you?
I have never had a PC go bad on me, ever. Had a couple of fans die,
but these days I only buy PCs that run fine without a fan.
So your proposition is, I can add nines to this system by introducing
atomic update of the backing store. Fine, I agree with you. However
if I already sit at six or seven nines then should I be putting my
effort there, or where?
Also no need to explain: when you introduce two way redundancy, you
square the reliability. So have two independent power supplies on two
independent UPSes. Sleep easy, plus you gain the ability to do
scheduled battery maintenance, so reliability increases by more than
the square.
No matter how much you fiddle with atomic update of backing store, one
disgruntled sysop going postal can still destroy your data with the
help of a sledgehammer. You need to get this reliability thing in
perspective.
So how about you draft a Suse engineer to get working on the atomic
backing store update, ETA six months? In the mean time, we can
configure a transactional system using this driver (once stabilized)
to as many nines as you want. Offsite replication using
ddsnap+zumastor? You got it. 5 second latency between replication
cycles? No problem. Incidentally, Ramback totally solves the ddsnap
copy-before-write seek bottleneck, turning replication into a really
elegant fallback solution.
By the way, if you want to fly to the moon you will need a rocket.
Streams of liquid hydrogen coursing through gigantic pipes sitting
right next to violently burning roman candles are less reliable than
bicycle pedals, but only one of these arrangements will get you to the
moon on time. In other words, if you need the speed this is the only
game in town, so you better just take care to buy reliable rocket
parts.
> > > So "perfectly reliable if UPS power does not fail" seems a bit over the
> > > top.
> > It works for EMC :-)
>
> Where they control the hardware and run a rather specialized OS as well,
> not a general purpose system like Linux on "commodity" hardware ;-)
Heh. Maybe you better ask Ric about that commodity hardware thing. I
am sure you are aware that servers with dual power supplies are now
commodity items.
> I'm afraid with those properties it doesn't really meet my needs :-(
See you around then, and thanks for all the effort. Not.
> And, wouldn't a simpler way to achieve something similar not be to use
> the plain Linux fs caching/buffers, just disabling forced write out
> maybe via a mount option? This strikes me as similar to the effect I get
> from remounting NFS (a)sync. Make the fs ignore fsync et al.
Maybe move it up into the VM/VFS after it is completely worked out at
the block layer. If it can't work as a block device it can't work
period.
Bear in mind that our VM is currently stuck together with bailing
wire and chewing gum, reliable only because of the man-machine-Morton
effect and Linus's annual Christmas race chase. Careful in there.
> It would have the advantage of using all memory available for caching
> and not otherwise requested, too.
Good point, a ram disk does not work like that. Oh wait, it does.
Bis spaeter,
Daniel
By the way are you aware that all you have to do is:
echo 1 >/proc/driver/ramback/<name>
and ramback already runs in the mode you speak of?
Daniel
On Tuesday 11 March 2008 13:53, Willy Tarreau wrote:
> On Tue, Mar 11, 2008 at 11:56:50AM -0800, Daniel Phillips wrote:
> > On Tuesday 11 March 2008 10:26, Chris Friesen wrote:
> > > I have experienced many 2.6 crashes due to software flaws. Hung
> > > processes leading to watchdog timeouts, bad kernel pointers, kernel
> > > deadlock, etc.
> > >
> > > When designing for reliable embedded systems it's not enough to handwave
> > > away the possibility of software flaws.
> >
> > Indeed. You fix them. Which 2.6 kernel version failed for you, what
> > made it fail, and does the latest version still fail?
>
> Daniel, you're not objective. Simply look at LKML reports. People even
> report failures with the stable branch, and that's quite expected when
> more than 5000 patches are merged in two weeks. The question could even
> be returned to you: what kernel are you using to keep 204 days of uptime
> doing all that you describe ? Maybe 2.6.16.x would be fine, but even then,
> a lot of security issues have been fixed since the last 204 days (most of
> which would require a planned reboot), and also a number of normal bugs,
> some of which may cause unexpected downtime.
So we have a flock of people arguing that you can't trust Linux. Well
maybe there are situations were you can't, but what can you trust?
Disk firmware? Bios? Big maybes everywhere. In my experience, Linux
is very reliable. I think Linus, Andrew and others care an awful lot
about that and go to considerable lengths to make it true. Got a list
of Linux kernel flaws that bring down a system? Tell me and I will not
use that version to run a transaction processing system, or I will fix
them or get them fixed.
But please do not tell me that Linux is too unreliable to run a
transaction processing system. If Linux can't do it, then what can?
By the way, the huge ramdisk that Violin ships runs Linux inside, to
manage the raided, hotswappable memory modules. (Even cooler: they
run Linux on a soft processor implemented on a big FPGA.) Does anybody
think that they did not test to make sure Linux does not compromise
their MTBF in any way?
In practice, for the week I was able to test the box remotely and the
10 days I had it in my hands, the thing was solid as a rock. Good
hardware engineering and a nice kernel I say.
> > If Linux is not reliable then we are doomed and I do not care about
> > whether my ramback fails because I will just slit my wrists anyway.
>
> No, I've looked at the violin-memory appliance, and I understand better
> your goal. Trying to get a secure and reliable generic kernel is a lost
> game. However, building a very specific kernel for an appliance is often
> quite achievable, because you know exactly the hardware, usage patterns,
> etc... which help you stabilize it.
Sure. Leaving out dodgy stuff like hald, other bits I could mention,
is probably a good idea. Scary thing is, thinks like hald are actually
being run on servers but that is another issue entirely.
It wasn't too long ago that NFS client was in the dodgy category, with
oops, lockups, whathaveyou. It is pretty solid now, but it takes a
while for the bad experiences to fade from memory. On the other hand,
knfsd has never been the slightest bit of a problem. Helpful
suggestion: don't run NFS client on your transaction processing unit.
It may well be solid, but who needs to find out experimentally? Might
as well toss gamin, dbus and udev while you are at it, for a further
marginal reliability increase. Oh, and alsa, no offense to the great
work there, but it just does not belong on a server. Definitely do
not boot into X (I know I should not have to say that, but...)
> I think people who don't agree with
> you are simply thinking about a generic file server. While I would not
> like my company's NFS exports to rely on such a technology for the same
> concerns as exposed here, I would love to have such a beast for logs
> analysis, build farms, or various computations which require more than
> an average PC's RAM, and would benefit from the data to remain consistent
> across planned downtime.
I guess I am actually going to run evaluations on some mission critical
systems using the arrangement described. I wish I could be more specific
about it, but I know of critical systems pushing massive data that in
fact rely on batteries just as I have described. For completeness, I
will verify that pulling the UPS plug actually corrupts the data and
report my findings. Not by pulling the plug of course, but by asking
the vendors.
> If you consider that the risk of a crash is 1/year and that you have to
> work one day to rebuild everything in case of a crash, it is certainly
> worth using this technology for many things. But if you consider that
> your data cannot suffer a loss even at a rate of 1/year, then you have
> to use something else.
I consider 1/year way too high a failure rate for anything that gets
onto a server I own, and then there must necessarily be systems in
place to limit the damage. For me, that means replication, or perhaps
synchronously mirroring the whole stack which is technology I do not
trust yet on Linux, so we don't do that. Yet.
So here is the tradeoff: do you take the huge performance boost you
get by plugging in the battery, and put the necessary backup systems
in place or do you accept a lower performing system that offers higher
theoretical reliability? It depends on your application. My
immediate application happens to be hacking kernels and taking diffs
which tends to suck rather badly on Linux. Ramback will fix that, and
it will be in place on my workstation right here, I will give my
report. (Bummer, I need to reboot because I don't feel like backporting
to 2.6.20, too bad about that 205 day uptime, but I have to close the
vmsplice hole anyway.) So I will have, say, a 3 GB source code
partition running under ramback and it will act just like spinning
media because of my UPS, except 25 times faster.
Of course the reason I feel brave about this is, everything useful
on that partition gets uploaded to the internet sooner rather than
later. Nonetheless, having to reinstall everything would cost me
hours, so I will certainly not do it if I think there is any
reasonable likelihood I might have to.
> BTW, I would say that IMHO nothing here makes RAID impossible to use :-)
> Just wire 2 of these beasts to a central server with 10 Gbps NICs and
> you have a nice server :-)
Right. See ddraid. It is in the pipeline, but everything takes time.
We also need to reroll NBD complete with deadlock fixes before I feel
good about that.
Regards,
Daniel
[email protected] wrote on 10/03/2008 06:46:16:
> Every little factor of 25 performance increase really helps.
>
> Ramback is a new virtual device with the ability to back a ramdisk
> by a real disk, obtaining the performance level of a ramdisk but with
> the data durability of a hard disk. To work this magic, ramback needs
> a little help from a UPS. In a typical test, ramback reduced a 25
> second file operation[1] to under one second including sync. Even
> greater gains are possible for seek-intensive applications.
What about doing a similar thing as a device mapper target? Have a look a
dm-cache, I know that development of that has stopped but it doesn't mean
it couldn't be ressurected. It has an advantage that it is generic (any
two block devices will do) and you don't need to populate the "cache" on
start-up - it happens automatically through cache misses.
Another use could be a flash based disk accelerator which may be pretty
popular nowadays.
Tvrtko
Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon,
OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.
Daniel Phillips <[email protected]> writes:
> It is ranked high, nonetheless I perceive this spate of sky-is-falling
> comments as low level FUD. Which of the following do you think is the
> least reliable component in your transactional system:
>
> 1) Linux
> 2) The computer
> 3) The hard disk
> 4) The battery
> 5) The fan
Everyone who has write cache turned on for their hard drives is
running in a mode similar to ramback anyway (except for when the file
system is set to force writes to the platter, but that is rare).
Admittedly, software crashes rarely cause the write cache to be lost,
but hardware failures do, practically every time.
Sure, for some things you don't turn write cache on, and for those you
probably don't use ramback. For the rest, ramback looks very enticing.
/Benny
> > Nice fiction - stuff crashes eventually - not that this isn't useful. For
> > a long time simply loading a 2-3GB Ramdisk off hard disk has been a good
> > way to build things like compile engines where loss of state is not bad.
>
> Right, and now with ramback you will be able to preserve that state and
> have the performance too. It is a wonderful world.
Actually no - ramback would be useless to this. You might crash and end
up with untrustworthy on disk state - not worth the risk.
> > Ext3 is only going to help you if the ramdisk writeback respects barriers
> > and ordering rules ?
>
> I was alluding to to e2fsck's amazing repair ability, not ext3's journal.
Oh you mean "pray hard". e2fsck works well with typical disk style
failures, it is not robust against random chunks vanishing. I know this
as I've worked on and debugged a case where a raid card rebooted silently
and threw out the write back cache.
> > /bin/cp from initrd
>
> But that does not satisfy the requirement you snipped:
True but its a lot simpler.
> More accurately: in general, cannot transfer directly. The ramdisk may
DMA ?
> > suck all the content back in presumably a log structure is not a big
> > concern ?
>
> Sorry, I failed to parse that.
I was suggesting that you want log structure for the writeback disk so
that you keep coherency and can recover it, an issue you seem intent on
ignoring in the interest of speed over any kind of practical usability.
>
> > > * Per chunk locking is not feasible for a terabyte scale ramdisk.
> >
> > And we care 8) ?
>
> "640K should be enough for anyone"
>
> http://www.violin-memory.com/products/violin1010.html <- 504 GB ramdisk
Ok so almost nobody cares
> The finer the granularity the faster the ramdisk syncs to backing
> store. The only attraction of coarse granularity I know of is
False. Disks are ops/second devices not bits/second.
> Your comment re fs chunk size reveals that I have failed to
> communicate the most basic principle of the ramback design: the
> backing store is not expected to represent a consistent filesystem
No I get that. You've ignored the fact I'm suggesting that design choice
is dumb.
> state during normal operation. Only the ramdisk needs to maintain a
> consistent state, which I have taken care to ensure. You just need
> to believe in your battery, Linux and the hardware it runs on. Which
> of these do you mistrust?
In a big critical environment - all three.
Alan
> Everyone who has write cache turned on for their hard drives is
> running in a mode similar to ramback anyway (except for when the file
> system is set to force writes to the platter, but that is rare).
> Admittedly, software crashes rarely cause the write cache to be lost,
> but hardware failures do, practically every time.
On the contrary - the hard disk cache is managed by the barrier logic in
the kernel, and the ordering even on failures is fairly predictable.
On 3/12/08, Daniel Phillips <[email protected]> wrote:
> On Tuesday 11 March 2008 13:53, Willy Tarreau wrote:
> > BTW, I would say that IMHO nothing here makes RAID impossible to use :-)
> > Just wire 2 of these beasts to a central server with 10 Gbps NICs and
> > you have a nice server :-)
>
>
> Right. See ddraid. It is in the pipeline, but everything takes time.
> We also need to reroll NBD complete with deadlock fixes before I feel
> good about that.
You've eluded to NBD needing deadlock fixes quite a few times in the
past. I've even had some discussions with you on where you see NBD
lacking (userspace nbd-server doesn't lock memory or set PF_MEMALLOC,
etc). But I've lost track of what changes you have in mind for NBD.
Are you talking about a complete re-write or do you have specific
patches that will salvage the existing NBD client and/or server? Has
this work already been done and you just need to dust it off?
As an aside, using a kernel with the new per bdi dirty page accounting
I've not been able to hit any deadlock scenarios with NBD. Am I not
trying hard enough? Or are they now mythical? If real, do you have a
reproducible scenario that will cause NBD to deadlock?
I'm not interested in swap over NBD (e.g. network memory reserves?)
because in practice I've found that the VM doesn't allow non-swap NBD
use-cases to actually need that "netvm" sophistication... any other
workload that deadlocks NBD would interesting.
thanks,
Mike
On Wednesday 12 March 2008 05:01, [email protected] wrote:
> [email protected] wrote on 10/03/2008 06:46:16:
>
> > Every little factor of 25 performance increase really helps.
> >
> > Ramback is a new virtual device with the ability to back a ramdisk
> > by a real disk, obtaining the performance level of a ramdisk but with
> > the data durability of a hard disk. To work this magic, ramback needs
> > a little help from a UPS. In a typical test, ramback reduced a 25
> > second file operation[1] to under one second including sync. Even
> > greater gains are possible for seek-intensive applications.
>
> What about doing a similar thing as a device mapper target? Have a look a
> dm-cache, I know that development of that has stopped but it doesn't mean
> it couldn't be ressurected. It has an advantage that it is generic (any
> two block devices will do) and you don't need to populate the "cache" on
> start-up - it happens automatically through cache misses.
It is a device mapper target (though there is no real advantage in that
other than having a handy plug-in api). It does handle any two block
devices, and it does populate on cache miss. But also has daemon-driven
population, since it never makes sense to leave the backing disk idle
then have to incur read latency because of that later.
Regards,
Daniel
On Wednesday 12 March 2008 06:11, Alan Cox wrote:
> > > Ext3 is only going to help you if the ramdisk writeback respects barriers
> > > and ordering rules ?
> >
> > I was alluding to to e2fsck's amazing repair ability, not ext3's journal.
>
> Oh you mean "pray hard". e2fsck works well with typical disk style
> failures, it is not robust against random chunks vanishing. I know this
> as I've worked on and debugged a case where a raid card rebooted silently
> and threw out the write back cache.
So then you know that people already rely on batteries in critical storage
applications. So I do not understand why all the FUD from you.
Particularly about Ext2/Ext3, which does recover well from random damage.
My experience.
> > Your comment re fs chunk size reveals that I have failed to
> > communicate the most basic principle of the ramback design: the
> > backing store is not expected to represent a consistent filesystem
>
> No I get that. You've ignored the fact I'm suggesting that design choice
> is dumb.
You seem to be calling Linux unreliable.
Daniel
Daniel Phillips wrote:
> On Wednesday 12 March 2008 06:11, Alan Cox wrote:
>>No I get that. You've ignored the fact I'm suggesting that design choice
>>is dumb.
>
>
> You seem to be calling Linux unreliable.
It's more reliable than many others, but it's not perfect.
Besides, there are many failure modes beyond the control of the kernel.
Hardware errors can lock up the bus and prevent I/O, RAM modules can
go bad, technicians can yank out cards without waiting for the ready
light.
For certain classes of devices it's necessary to plan for these sorts of
things, and a model where the on-disk structures may be inconsistent by
design is not going to be very attractive.
Chris
On Wednesday 12 March 2008 11:11, Chris Friesen wrote:
> Daniel Phillips wrote:
> > You seem to be calling Linux unreliable.
>
> It's more reliable than many others, but it's not perfect.
>
> Besides, there are many failure modes beyond the control of the kernel.
> Hardware errors can lock up the bus and prevent I/O, RAM modules can
> go bad, technicians can yank out cards without waiting for the ready
> light.
...disks can break, batteries on raid controllers can fail, etc, etc...
So you design for the number of nines you need, taking all factors
into account, and you design for the performance you need. These are
cut and dried calculations. FUD has no place here.
> For certain classes of devices it's necessary to plan for these sorts of
> things, and a model where the on-disk structures may be inconsistent by
> design is not going to be very attractive.
You are preaching to the converted. Systems consisting of:
linux + disks + batteries + ram + network + redundancy
can be as reliable as you need. Respectfully, I would like to return
to the software engineering problem. This driver solves a problem for
certain people. Not niche people to be forgotten about. If it does
not solve your problem then please just write a driver that does,
meanwhile this one needs some finishing work. Lets get the proverbial
thousand eyeballs working. Has anybody besides me compiled this yet?
Daniel
Daniel Phillips wrote:
> Particularly about Ext2/Ext3, which does recover well from random damage.
> My experience.
By "recover well", you must mean "loses massive swabs of data, leaving
the system unbootable and with enormous numbers of user files missing."
My experience.
Expecting fsck to cover for missed writes is stupid.
Daniel Phillips wrote:
> So you design for the number of nines you need, taking all factors
> into account, and you design for the performance you need. These are
> cut and dried calculations. FUD has no place here.
>
There's no FUD here. The problem is that you didn't say that you've
designed this for only a few nines. If you delete fsck from your
rationale, simply saying that you rely on UPS to give you time to flush
buffers, you have a much better story. Certainly, once you've flushed
buffers and degraded to write-through mode, you're obviously as reliable
as ext2/3.
Your idea seems predicated on throwing large amounts of RAM at the
problem. What I want to know is this: Is it really 25 times faster than
ext3 with an equally huge buffer cache?
On Wednesday 12 March 2008 22:39, David Newall wrote:
> Daniel Phillips wrote:
> > Particularly about Ext2/Ext3, which does recover well from random damage.
> > My experience.
>
> By "recover well", you must mean "loses massive swabs of data, leaving
> the system unbootable and with enormous numbers of user files missing."
> My experience.
>
> Expecting fsck to cover for missed writes is stupid.
Whatever it can get off the disk it gets. It does a good job. If you
don't think so, then don't tell me, tell Ted.
Daniel
On Wednesday 12 March 2008 22:45, David Newall wrote:
> Daniel Phillips wrote:
> > So you design for the number of nines you need, taking all factors
> > into account, and you design for the performance you need. These are
> > cut and dried calculations. FUD has no place here.
>
> There's no FUD here. The problem is that you didn't say that you've
> designed this for only a few nines.
Right. 6 or 7.
> If you delete fsck from your
> rationale, simply saying that you rely on UPS to give you time to flush
> buffers, you have a much better story. Certainly, once you've flushed
> buffers and degraded to write-through mode, you're obviously as reliable
> as ext2/3.
Fsck was never a part of my rationale. Only reliability of components
was and is. Then people jumped in saying Linux is too unreliable to
use in a, hmm, storage system. Or transaction processing system. Or
whatever.
Balderdash, I say.
> Your idea seems predicated on throwing large amounts of RAM at the
> problem. What I want to know is this: Is it really 25 times faster than
> ext3 with an equally huge buffer cache?
Yes.
Regards,
Daniel
Daniel Phillips wrote:
>> Your idea seems predicated on throwing large amounts of RAM at the
>> problem. What I want to know is this: Is it really 25 times faster than
>> ext3 with an equally huge buffer cache?
>>
>
> Yes.
Well, that sounds convincing. Not. You know this how?
On Wed, 12 Mar 2008, Daniel Phillips wrote:
> On Wednesday 12 March 2008 22:45, David Newall wrote:
>
>> Your idea seems predicated on throwing large amounts of RAM at the
>> problem. What I want to know is this: Is it really 25 times faster than
>> ext3 with an equally huge buffer cache?
>
> Yes.
this I don't understand. what makes your approach 25x faster?
looking at the comparison of a 500G filesystem with 500G of ram allocated
for a buffer cache.
yes, initially it will be a bit slower (until the files get into the
buffer cache), and if fsync is disabled all writes will go to the buffer
cache (until writeout hits)
I may be able to see room for a few percent difference, but not 2x, let
alone 25x.
David Lang
On Wednesday 12 March 2008 23:30, David Newall wrote:
> Daniel Phillips wrote:
> >> Your idea seems predicated on throwing large amounts of RAM at the
> >> problem. What I want to know is this: Is it really 25 times faster than
> >> ext3 with an equally huge buffer cache?
> >
> > Yes.
>
> Well, that sounds convincing. Not. You know this how?
By measuring it. time untar -xf linux-2.2.26.tar; time sync
Daniel
Daniel Phillips wrote:
> On Wednesday 12 March 2008 23:30, David Newall wrote:
>
>> Daniel Phillips wrote:
>>
>>>> Your idea seems predicated on throwing large amounts of RAM at the
>>>> problem. What I want to know is this: Is it really 25 times faster than
>>>> ext3 with an equally huge buffer cache?
>>>>
>>> Yes.
>>>
>> Well, that sounds convincing. Not. You know this how?
>>
>
> By measuring it. time untar -xf linux-2.2.26.tar; time sync
>
No numbers. No specifications. And by doing a sync, you explicitly
excluded what I was asking, namely a big buffer cache. You've certainly
convinced me; you don't know if your idea is worth a brass razoo.
Come back when you've got some hard data.
On Wednesday 12 March 2008 23:32, [email protected] wrote:
> looking at the comparison of a 500G filesystem with 500G of ram allocated
> for a buffer cache.
>
> yes, initially it will be a bit slower (until the files get into the
> buffer cache), and if fsync is disabled all writes will go to the buffer
> cache (until writeout hits)
>
> I may be able to see room for a few percent difference, but not 2x, let
> alone 25x.
My test ran 25 times faster because it was write intensive and included
sync. It did not however include seeks, which can cause an even bigger
performance gap.
The truth is, my system has _more_ cache available for file buffering
than I used for the ramdisk, and almost every file operation I do
(typically dozens of tree diffs, hundreds of compiles per day) goes
_way_ faster on the ram disk. Really, really a lot faster. Because
frankly, Linux is not very good at using its file cache these days.
Somebody ought to fix that. (I am busy fixing other things.)
In other, _real world_ NFS file serving tests, we have seen 20 - 200
times speedup in serving snapshotted volumes via NFS, using ddsnap
for snapshots and replication. While it is true that ddsnap will
eventually be optimized to improved performance on spinning media,
I seriously doubt it will ever get closer than a factor of 20 or so,
with a typical read/write mix.
But that is just the pragmatic reality of machines everybody has these
days, let us not get too wrapped up in that. Think about the Violin
box. How are you going to put 504 gigabytes of data in buffer cache?
Tell me how a transaction processing system is going to run with
latency measured in microseconds, backed by hard disk, ever?
Really guys, ramdisks are fast. Admit it, they are really really fast.
So I provide a way to make them persistent also. For free, I might
add.
Why am I reminded of old arguments like "if men were meant to fly, God
would have given them wings"? Please just give me your microsecond
scale transaction processing solution and I will be impressed and
grateful. Until then... here is mine. Service with a smile.
Daniel
On Thursday 13 March 2008 00:05, David Newall wrote:
> No numbers. No specifications. And by doing a sync, you explicitly
> excluded what I was asking, namely a big buffer cache. You've certainly
> convinced me; you don't know if your idea is worth a brass razoo.
>
> Come back when you've got some hard data.
...or download the code and try it yourself.
On Wed, 12 Mar 2008, Daniel Phillips wrote:
> On Wednesday 12 March 2008 23:32, [email protected] wrote:
>> looking at the comparison of a 500G filesystem with 500G of ram allocated
>> for a buffer cache.
>>
>> yes, initially it will be a bit slower (until the files get into the
>> buffer cache), and if fsync is disabled all writes will go to the buffer
>> cache (until writeout hits)
>>
>> I may be able to see room for a few percent difference, but not 2x, let
>> alone 25x.
>
> My test ran 25 times faster because it was write intensive and included
> sync. It did not however include seeks, which can cause an even bigger
> performance gap.
if you are not measuring the time to get from ram to disk (which you are
not doing in your ramback device) syncs are meaningless.
seeks should only be a factor in the process of populating the buffer
cache. both systems need to read the data from disk to the cache, they can
either fault the data in as it's accessed, or run a process to read it all
in as a batch.
> The truth is, my system has _more_ cache available for file buffering
> than I used for the ramdisk, and almost every file operation I do
> (typically dozens of tree diffs, hundreds of compiles per day) goes
> _way_ faster on the ram disk. Really, really a lot faster. Because
> frankly, Linux is not very good at using its file cache these days.
> Somebody ought to fix that. (I am busy fixing other things.)
so you are saying that when the buffer cache stores the data from your ram
disk it will slow down. that sounds like it equalizes the performance and
is a problem that needs to be solved for ramdisks as well.
> In other, _real world_ NFS file serving tests, we have seen 20 - 200
> times speedup in serving snapshotted volumes via NFS, using ddsnap
> for snapshots and replication. While it is true that ddsnap will
> eventually be optimized to improved performance on spinning media,
> I seriously doubt it will ever get closer than a factor of 20 or so,
> with a typical read/write mix.
NFS is a very different beast.
> But that is just the pragmatic reality of machines everybody has these
> days, let us not get too wrapped up in that. Think about the Violin
> box. How are you going to put 504 gigabytes of data in buffer cache?
> Tell me how a transaction processing system is going to run with
> latency measured in microseconds, backed by hard disk, ever?
it all depends on how you define the term 'backed by hard disk' if you
don't write to the hard disk and just dirty pages in ram you can easily
hit that sort of latency. I don't understand why you say it's so hard to
put 504G of data into the buffer cache, you just read it and it's in the
cache.
> Really guys, ramdisks are fast. Admit it, they are really really fast.
nobody is disputing this.
> So I provide a way to make them persistent also. For free, I might
> add.
except that you are redefining the terms 'persistent' and 'free' to mean
something different than what everyone else understands them to be.
> Why am I reminded of old arguments like "if men were meant to fly, God
> would have given them wings"? Please just give me your microsecond
> scale transaction processing solution and I will be impressed and
> grateful. Until then... here is mine. Service with a smile.
if you don't have to worry about unclean shutdowns then your system is not
needed. all you need to do is to create a ramdisk that you populate with
dd at boot time and save to disk with dd at shutdown. problem solved in a
couple lines of shell scripts and no kernel changes needed.
if you want the data to be safe in the face of unclean shutdowns and
crashes, then you need to figure out how to make the image on disk
consistant, and at this point you have basicly said that you don't think
that it's a problem. so we're back to what you can do today with a couple
lines of scripting.
David Lang
On Thursday 13 March 2008 00:55, [email protected] wrote:
> if you are not measuring the time to get from ram to disk (which you are
> not doing in your ramback device) syncs are meaningless.
There was a time when punchcards ruled and everybody was nervous about
storing their data on magnetic media. I remember it well, you may not.
But you are repeating that bit of history, there is a proverb in there
somewhere.
> > Why am I reminded of old arguments like "if men were meant to fly, God
> > would have given them wings"? Please just give me your microsecond
> > scale transaction processing solution and I will be impressed and
> > grateful. Until then... here is mine. Service with a smile.
>
> if you don't have to worry about unclean shutdowns then your system is not
> needed. all you need to do is to create a ramdisk that you populate with
> dd at boot time and save to disk with dd at shutdown. problem solved in a
> couple lines of shell scripts and no kernel changes needed.
>
> if you want the data to be safe in the face of unclean shutdowns and
> crashes, then you need to figure out how to make the image on disk
> consistant, and at this point you have basicly said that you don't think
> that it's a problem. so we're back to what you can do today with a couple
> lines of scripting.
Feel free. You use your script, and somebody with a reliable UPS or
two can use my driver, once it is stabilized of course. Just don't be
in business against them if being a few milliseconds slower on the
uptake means money lost.
Daniel
On Thu, 13 Mar 2008, Daniel Phillips wrote:
> On Thursday 13 March 2008 00:55, [email protected] wrote:
>> if you are not measuring the time to get from ram to disk (which you are
>> not doing in your ramback device) syncs are meaningless.
>
> There was a time when punchcards ruled and everybody was nervous about
> storing their data on magnetic media. I remember it well, you may not.
> But you are repeating that bit of history, there is a proverb in there
> somewhere.
>
>>> Why am I reminded of old arguments like "if men were meant to fly, God
>>> would have given them wings"? Please just give me your microsecond
>>> scale transaction processing solution and I will be impressed and
>>> grateful. Until then... here is mine. Service with a smile.
>>
>> if you don't have to worry about unclean shutdowns then your system is not
>> needed. all you need to do is to create a ramdisk that you populate with
>> dd at boot time and save to disk with dd at shutdown. problem solved in a
>> couple lines of shell scripts and no kernel changes needed.
>>
>> if you want the data to be safe in the face of unclean shutdowns and
>> crashes, then you need to figure out how to make the image on disk
>> consistant, and at this point you have basicly said that you don't think
>> that it's a problem. so we're back to what you can do today with a couple
>> lines of scripting.
>
> Feel free. You use your script, and somebody with a reliable UPS or
> two can use my driver, once it is stabilized of course. Just don't be
> in business against them if being a few milliseconds slower on the
> uptake means money lost.
so now you are saying that you are faster then a ramdisk?????
either you are completely out of touch or you misunderstood what I was
saying.
if you have a reliable UPS and are willing to rely on it to save your data
take the identical hardware to what you are planning to use, but instead
of using your driver just create a ramdisk and load it on boot and save
the contents on shutdown.
in this case you are doing zero disk I/O during normal operation, you only
touch the disk during startup and shutdown.
with your proposal the system will be copying chunks of data from the
ramdisk to the hard disk at random times, and you are claiming that doing
so makes you faster then a ramdisk????
I'll say it again. if you trust your UPS and don't care about unclean
shutdowns (say for example that you trust that linux is just never going
to crash) there's no need to write parts of the ramdisk to the hard disk
during normal operation, you can wait until you are told that you are
going to shutdown to do the data save.
now there's no driver needed, just a couple lines of init scripts.
David Lang
On 11.03.2008 15:02, Daniel Phillips wrote:
> By the way, if you want to fly to the moon you will need a rocket.
> Streams of liquid hydrogen coursing through gigantic pipes sitting
> right next to violently burning roman candles are less reliable than
> bicycle pedals, but only one of these arrangements will get you to the
> moon on time. In other words, if you need the speed this is the only
> game in town, so you better just take care to buy reliable rocket
> parts.
Or to quote a little SciFi, in this case Captain Hunt from Andromeda:
Slipstream: it's not the best way to travel faster than light, it's just
the only way.[1]
At least that's what first crossed my mind after reading the above. ;-)
[1]: http://en.wikipedia.org/wiki/Slipstream_%28science_fiction%29#Andromeda
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
On Thursday 13 March 2008 01:39, [email protected] wrote:
> > Feel free. You use your script, and somebody with a reliable UPS or
> > two can use my driver, once it is stabilized of course. Just don't be
> > in business against them if being a few milliseconds slower on the
> > uptake means money lost.
>
> so now you are saying that you are faster then a ramdisk?????
No, I am saying that my driver is faster than any script you can write.
Your script will not be able to give access to data while the ramdisk
is being populated, nor will it be able to save efficiently exactly
what is dirty in the ramdisk. (Explained in my original post if you
would like to take a look.)
> if you have a reliable UPS and are willing to rely on it to save your data
> take the identical hardware to what you are planning to use, but instead
> of using your driver just create a ramdisk and load it on boot and save
> the contents on shutdown.
Aha! You are getting close. Really, that is all ramback does. It
just handles some very difficult related issues efficiently, in such a
way as to minimize any denial of service from complete loss of UPS
power. This is all just about using power management in a new way that
gets higher performance. But your battery power has to be reliable.
Just make it so. It is not difficult these days, or even particularly
expensive.
I calculated somewhere along the line that it would take something like
17 minutes to populate the big Violin ramdisk initially, and 17 minutes
to save it during a loss of line power event, during which UPS power
must be not run out before ramback achieves disk sync or you will get
file corruption. (This rule was mentioned in my original post.)
All well and could, you can in fact do that with a pretty simple script.
But in the initial 17 minutes your application may not read or write
the ramdisk data and in the closing 17 minutes it may not write. That
knocks your system down to 4 nines, given one planned shutdown per year.
Not good, not good at all.
See, ramback is entirely about _not_ getting knocked down to 4 nines.
It wants to stay above 6, given system components that satisfy that goal,
comprising:
* Linux
* Processor, memory, motherboard etc
* Dual power supplies with independent UPS backup
* Ramback driver
My proposition is, you can go out and purchase hardware right now that
delivers 6 nines (30 seconds downtime/year) and yes, it will cost you,
but if that worries you then set up two (much) cheaper ones and set
them up as a failover cluster. (Helps that the Violin box can connect
via PCI-e to two servers at the same time.)
I say you can do this reliably. It boils down to your power supplies.
You say it can't be done. Who is right?
Daniel
On Thursday 13 March 2008 00:55, [email protected] wrote:
> I don't understand why you say it's so hard to
> put 504G of data into the buffer cache, you just read it and it's in the
> cache.
Sorry, I missed that the first time. See, it is 504G, not 504M.
Daniel
Lars Marowsky-Bree wrote:
> On 2008-03-11T03:50:18, Daniel Phillips <[email protected]> wrote:
>
>>> as well. In fact, the most common reason for unorderly shutdowns are
>>> kernel crashes, not power failures in my experience.
>> What are you doing to your kernel?
>
> I guess I'm being really vicious to them: I expose it to customers and
> the real world.
>
> My own servers also have uptimes of >400 days sometimes, and I wonder
> what customers do to the poor things.
>
> And yes, I'm not saying I don't see your point for specialised
> deployments (filesystems which are easy to rebuild from scratch), but
> transactional integrity is a requirement I'd rank really high on the
> desirable list of features if I was you.
>
>>> So "perfectly reliable if UPS power does not fail" seems a bit over the
>>> top.
>> It works for EMC :-)
>
> Where they control the hardware and run a rather specialized OS as well,
> not a general purpose system like Linux on "commodity" hardware ;-)
Actually, in Centera we use generic hardware with a fairly normal kernel
which has strategic backports from upstream (libata, nic drivers, etc).
No UPS in the picture. Data integrity is protected by working with the
application team to insure they understand when data is safely on the
disk platter and working with IO & FS people to try and make sure we
don't lie to them (too much ) about that promise.
The centera boxes are tested with power failure & error injection and by
all of our customers in all those ways customers do ;-)
ric
On Wed, 12 Mar 2008 22:14:16 -0800
Daniel Phillips <[email protected]> wrote:
> On Wednesday 12 March 2008 22:39, David Newall wrote:
> > Daniel Phillips wrote:
> > > Particularly about Ext2/Ext3, which does recover well from random damage.
> > > My experience.
> >
> > By "recover well", you must mean "loses massive swabs of data, leaving
> > the system unbootable and with enormous numbers of user files missing."
> > My experience.
> >
> > Expecting fsck to cover for missed writes is stupid.
>
> Whatever it can get off the disk it gets. It does a good job. If you
> don't think so, then don't tell me, tell Ted.
He knows. Ext3 cannot recover well from massive loss of intermediate
writes. It isn't a normal failure mode and there isn't sufficient fs
metadata robustness for this. A log structured backing store would deal
with that but all you apparently want to do is scream FUD at anyone who
doesn't agree with you.
Alan
Alan Cox <[email protected]> writes:
> On the contrary - the hard disk cache is managed by the barrier logic in
> the kernel, and the ordering even on failures is fairly predictable.
Doesn't that require explicitly setting barrier=1 for ext3? Are there
any distributions which set that by default?
/Benny
On Thu, 13 Mar 2008, Daniel Phillips wrote:
> On Thursday 13 March 2008 01:39, [email protected] wrote:
>>> Feel free. You use your script, and somebody with a reliable UPS or
>>> two can use my driver, once it is stabilized of course. Just don't be
>>> in business against them if being a few milliseconds slower on the
>>> uptake means money lost.
>>
>> so now you are saying that you are faster then a ramdisk?????
>
> No, I am saying that my driver is faster than any script you can write.
> Your script will not be able to give access to data while the ramdisk
> is being populated, nor will it be able to save efficiently exactly
> what is dirty in the ramdisk. (Explained in my original post if you
> would like to take a look.)
that is not something that makes a difference of a few milliseconds
>> if you have a reliable UPS and are willing to rely on it to save your data
>> take the identical hardware to what you are planning to use, but instead
>> of using your driver just create a ramdisk and load it on boot and save
>> the contents on shutdown.
>
> Aha! You are getting close. Really, that is all ramback does. It
> just handles some very difficult related issues efficiently, in such a
> way as to minimize any denial of service from complete loss of UPS
> power. This is all just about using power management in a new way that
> gets higher performance. But your battery power has to be reliable.
> Just make it so. It is not difficult these days, or even particularly
> expensive.
>
> I calculated somewhere along the line that it would take something like
> 17 minutes to populate the big Violin ramdisk initially, and 17 minutes
> to save it during a loss of line power event, during which UPS power
> must be not run out before ramback achieves disk sync or you will get
> file corruption. (This rule was mentioned in my original post.)
>
> All well and could, you can in fact do that with a pretty simple script.
> But in the initial 17 minutes your application may not read or write
> the ramdisk data and in the closing 17 minutes it may not write. That
> knocks your system down to 4 nines, given one planned shutdown per year.
> Not good, not good at all.
>
> See, ramback is entirely about _not_ getting knocked down to 4 nines.
> It wants to stay above 6, given system components that satisfy that goal,
> comprising:
>
> * Linux
> * Processor, memory, motherboard etc
> * Dual power supplies with independent UPS backup
> * Ramback driver
>
> My proposition is, you can go out and purchase hardware right now that
> delivers 6 nines (30 seconds downtime/year) and yes, it will cost you,
> but if that worries you then set up two (much) cheaper ones and set
> them up as a failover cluster. (Helps that the Violin box can connect
> via PCI-e to two servers at the same time.)
you just use a redundant system and you no longer care how long it takes
to shutdown or startup a system. if you're running a datacenter that cares
about uptime to the point of counting 9's or buying a Violin box you are
already going to need redundant machines so that you can perform upgrades.
> I say you can do this reliably. It boils down to your power supplies.
> You say it can't be done. Who is right?
it also takes the faith that you will never have any unplanned shutdowns,
since your system will loose massive amounts of data if they happen.
nobody who worries about 9's will buy into that argument. you achieve 9's
by figuring that things don't always work, and as a result you figure out
how to engineer around the failures so that when they happen you stay up.
manufacturers have been trying to promise that their boxes are so reliable
that they won't go down for decades, and they haven't suceeded yet.
David Lang
On Thursday 13 March 2008 06:27, Ric Wheeler wrote:
> >>> So "perfectly reliable if UPS power does not fail" seems a bit over the
> >>> top.
> >> It works for EMC :-)
> >
> > Where they control the hardware and run a rather specialized OS as well,
> > not a general purpose system like Linux on "commodity" hardware ;-)
>
> Actually, in Centera we use generic hardware with a fairly normal kernel
> which has strategic backports from upstream (libata, nic drivers, etc).
>
> No UPS in the picture. Data integrity is protected by working with the
> application team to insure they understand when data is safely on the
> disk platter and working with IO & FS people to try and make sure we
> don't lie to them (too much ) about that promise.
>
> The centera boxes are tested with power failure & error injection and by
> all of our customers in all those ways customers do ;-)
Hi Ric,
Right, so Linux has gotten to the point where it competes with purpose-
built embedded software in reliability. Not quite there, but close
enough for mission-critical.
I was not thinking of Centera when I mentioned the UPS though...
Daniel
On Thursday 13 March 2008 06:22, Alan Cox wrote:
> ...Ext3 cannot recover well from massive loss of intermediate
> writes. It isn't a normal failure mode and there isn't sufficient fs
> metadata robustness for this. A log structured backing store would deal
> with that but all you apparently want to do is scream FUD at anyone who
> doesn't agree with you.
Scream is an exaggeration, and FUD only applies to somebody who
consistently overlooks the primary proposition in this design: that the
battery backed power supply, computer hardware and Linux are reliable
enough to entrust your data to them. I say this is practical, you say
it is impossible, I say FUD.
All you are proposing is that nobody can entrust their data to any
hardware. Good point. There is no absolute reliability, only degrees
of it.
Many raid controllers now have battery backed writeback cache, which
is exactly the same reliability proposition as ramback, on a smaller
scale. Do you refuse to entrust your corporate data to such
controllers?
Daniel
Daniel Phillips wrote:
> On Thursday 13 March 2008 06:27, Ric Wheeler wrote:
>>>>> So "perfectly reliable if UPS power does not fail" seems a bit over the
>>>>> top.
>>>> It works for EMC :-)
>>> Where they control the hardware and run a rather specialized OS as well,
>>> not a general purpose system like Linux on "commodity" hardware ;-)
>> Actually, in Centera we use generic hardware with a fairly normal kernel
>> which has strategic backports from upstream (libata, nic drivers, etc).
>>
>> No UPS in the picture. Data integrity is protected by working with the
>> application team to insure they understand when data is safely on the
>> disk platter and working with IO & FS people to try and make sure we
>> don't lie to them (too much ) about that promise.
>>
>> The centera boxes are tested with power failure & error injection and by
>> all of our customers in all those ways customers do ;-)
>
> Hi Ric,
>
> Right, so Linux has gotten to the point where it competes with purpose-
> built embedded software in reliability. Not quite there, but close
> enough for mission-critical.
This is our case, but we have been working for quite a while to enhance
the reliability of the io stack & file systems. It also helps to be very
careful to select hardware components with mature, open source &
natively integrated drivers ;-)
>
> I was not thinking of Centera when I mentioned the UPS though...
>
> Daniel
No problem, we certainly have many boxes with built in ups hardware ;-)
ric
On Thursday 13 March 2008 09:25, [email protected] wrote:
> On Thu, 13 Mar 2008, Daniel Phillips wrote:
> > My proposition is, you can go out and purchase hardware right now that
> > delivers 6 nines (30 seconds downtime/year) and yes, it will cost you,
> > but if that worries you then set up two (much) cheaper ones and set
> > them up as a failover cluster. (Helps that the Violin box can connect
> > via PCI-e to two servers at the same time.)
>
> you just use a redundant system and you no longer care how long it takes
> to shutdown or startup a system. if you're running a datacenter that cares
> about uptime to the point of counting 9's or buying a Violin box you are
> already going to need redundant machines so that you can perform upgrades.
The period where you cannot access the data is downtime. If your script
just does a cp from a disk array to the ram device you cannot just read
from the backing store in that period because you will need to fail over
to the ramdisk at some point, and you cannot just read from the ramdisk
because it is not populated yet. My point is, you cannot implement
ramback as a two line script and expect to achieve anything resembling
continuous data availability.
I interpret your point about the script as, Ramback is trivial and easy
to implement. That is kind of true and kind of untrue, because of the
additional requirements mentioned in my original post.
> > I say you can do this reliably. It boils down to your power supplies.
> > You say it can't be done. Who is right?
>
> it also takes the faith that you will never have any unplanned shutdowns,
Never is not the right word, but indeed that is why I wrote the story
about the rocket ship. If you want the performance that ramback
delivers then you cover the risk of hardware failure by other,
standard means.
> since your system will loose massive amounts of data if they happen.
Why would you assume the data is not mirrored or replicated with a
short cycle, or no other suitable fallback plan is in place?
> nobody who worries about 9's will buy into that argument. you achieve 9's
> by figuring that things don't always work, and as a result you figure out
> how to engineer around the failures so that when they happen you stay up.
> manufacturers have been trying to promise that their boxes are so reliable
> that they won't go down for decades, and they haven't suceeded yet.
All true. Now what about the punchcard versus magnetic media story?
There was a time when magnetic domains were considered less reliable
than holes in paper cards, ironically we now think the opposite. So
some people will have a hard time with the idea that a battery is
reliable enough to get your important cached data on to hard disk when
necessary, or that Linux is reliable enough to trust data to it, or
whatever. They will get over it. Battery backed data will become a
normal part of your life as progress marches on.
Daniel
On Thursday 13 March 2008 10:16, [email protected] wrote:
> can it support the "fast" block device from the pair to be smaller in
> capacity but overall virtual device to still be as big as the backing
> store?
It can't. Since that problem is a strict superset of the one-to-one
problem that ramback solves, I thought it would make sense to work out
the bugs in ramback first before tackling the harder problem.
Daniel
On Thursday 13 March 2008 12:12, Ric Wheeler wrote:
> > Right, so Linux has gotten to the point where it competes with purpose-
> > built embedded software in reliability. Not quite there, but close
> > enough for mission-critical.
>
> This is our case, but we have been working for quite a while to enhance
> the reliability of the io stack & file systems. It also helps to be very
> careful to select hardware components with mature, open source &
> natively integrated drivers ;-)
A word to the wise indeed. Well I would never suggest that we can rest
on our laurels as far as Linux reliability in concerned, only that it is
already very reliable or you certainly would not ship products based on
it.
Daniel
Daniel Phillips wrote:
> The period where you cannot access the data is downtime. If your script
> just does a cp from a disk array to the ram device you cannot just read
> from the backing store in that period because you will need to fail over
> to the ramdisk at some point, and you cannot just read from the ramdisk
> because it is not populated yet.
Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
On Thursday 13 March 2008 12:50, David Newall wrote:
> Daniel Phillips wrote:
> > The period where you cannot access the data is downtime. If your script
> > just does a cp from a disk array to the ram device you cannot just read
> > from the backing store in that period because you will need to fail over
> > to the ramdisk at some point, and you cannot just read from the ramdisk
> > because it is not populated yet.
>
> Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
In raid1, write completion has to wait for write completion on all
mirror members, so writes run at disk speed. Reads run at ramdisk
speed, so your proposal sounds useful, but ramback aims for high
write performance as well.
Daniel
On Thu, 13 Mar 2008 11:14:39 -0800
Daniel Phillips <[email protected]> wrote:
> Scream is an exaggeration, and FUD only applies to somebody who
> consistently overlooks the primary proposition in this design: that the
> battery backed power supply, computer hardware and Linux are reliable
> enough to entrust your data to them.
That's a reasonable enough assumption, to anyone who has never dealt
with software before, or whose data is just not important.
People who have dealt with computers for longer will know that anything
can fail at any time, and usually does unexpectedly and at bad moments.
Some defensive programming to deal with random failures could make your
project appealing to a lot more people than it would appeal to in its
current state.
--
All Rights Reversed
On Wed, 12 Mar 2008 00:17:56 -0800
Daniel Phillips <[email protected]> wrote:
> So we have a flock of people arguing that you can't trust Linux. Well
> maybe there are situations were you can't, but what can you trust?
> Disk firmware? Bios? Big maybes everywhere.
The traditional and proven method to constructing a reliable system is
to assume that no component can be fully trusted. This is especially
true for new code.
By being paranoid about everything, failures in one component are
usually contained well enough that one failure is not catastrophic.
In order for ramback to get appeal with the people who are paranoid
about data integrity (probably a vast majority of users), you will
need some guarantees about flush order, etc...
--
All Rights Reversed
On Thursday 13 March 2008 13:34, Rik van Riel wrote:
> On Wed, 12 Mar 2008 00:17:56 -0800
> Daniel Phillips <[email protected]> wrote:
>
> > So we have a flock of people arguing that you can't trust Linux. Well
> > maybe there are situations were you can't, but what can you trust?
> > Disk firmware? Bios? Big maybes everywhere.
>
> The traditional and proven method to constructing a reliable system is
> to assume that no component can be fully trusted. This is especially
> true for new code.
>
> By being paranoid about everything, failures in one component are
> usually contained well enough that one failure is not catastrophic.
>
> In order for ramback to get appeal with the people who are paranoid
> about data integrity (probably a vast majority of users), you will
> need some guarantees about flush order, etc...
I disagree. Never mind that it already does provide such guarantees,
just echo 1 >/proc/driver/ramback/name. But if you want the full
performance you need to satisfy your paranoia at a higher level in
the traditional way: by running two in parallel or whatever.
Daniel
On Thursday 13 March 2008 13:27, Rik van Riel wrote:
> On Thu, 13 Mar 2008 11:14:39 -0800
> Daniel Phillips <[email protected]> wrote:
>
> > Scream is an exaggeration, and FUD only applies to somebody who
> > consistently overlooks the primary proposition in this design: that the
> > battery backed power supply, computer hardware and Linux are reliable
> > enough to entrust your data to them.
>
> That's a reasonable enough assumption, to anyone who has never dealt
> with software before, or whose data is just not important.
>
> People who have dealt with computers for longer will know that anything
> can fail at any time, and usually does unexpectedly and at bad moments.
>
> Some defensive programming to deal with random failures could make your
> project appealing to a lot more people than it would appeal to in its
> current state.
In its current state it has bugs and so should appeal only to
programmers who like to work with cutting edge stuff.
So long as you keep insisting it has to have some kind of slow
transactional sync to disk in order to be reliable enough for
enterprise use, I have to leave you in my FUD filter. Did you
read Ric's post where he mentions the UPS in some EMS products?
Ask yourself, what is the UPS for? Then ask yourself if EMC
makes billions of dollars selling those things to enterprise
clients.
Daniel
I'd like to seem some science. I'd like to know how much faster it
really is, and for that proper testing needs to be done. Since Daniel's
scheme uses the same amount of RAM as disk, an appropriate test would be
to pin (at least) that amount of RAM to buffer cache, and then to fill
the cache with the contents of the disk (i.e. cat /dev/disk >
/dev/null.) This sets the stage for tests, which tests should not
include the sync operation. I'd like to see actual numbers against such
a setup versus Daniel's scheme. Since buffer cache is shared by all
disks, obviously the test must not access any other drive.
One thing I will admit: RAM disks are fast. What I don't know is how
much work there is to access blocks that are already in the buffer
cache. In principle I suppose it should be a little slower, but not
much. I'd like to know, though. I'd do the test myself if I had a
machine with enough RAM, but I don't. Daniel (apparently) does...
On Thursday 13 March 2008 22:22, David Newall wrote:
> I'd like to seem some science. I'd like to know how much faster it
> really is, and for that proper testing needs to be done. Since Daniel's
> scheme uses the same amount of RAM as disk, an appropriate test would be
> to pin (at least) that amount of RAM to buffer cache, and then to fill
> the cache with the contents of the disk (i.e. cat /dev/disk >
> /dev/null.) This sets the stage for tests, which tests should not
> include the sync operation. I'd like to see actual numbers against such
> a setup versus Daniel's scheme. Since buffer cache is shared by all
> disks, obviously the test must not access any other drive.
There is a correctable flaw in your experiment: loading the disk into
buffer cache does not make the cached data available to the page
cache. Maybe it should (good summer project there for somebody) but
for now you need to tar the filesystem to dev/null or similar. Note
that, because of poor cross-directory readahead, traversing a disk
like that will not be as fast as reading it linearly. On the other
hand, you will not have to read any free space into cache, which
ramback does because it does not know what is free space (or care,
really...)
Anyway, your investigative attitude is worth gold :-)
> One thing I will admit: RAM disks are fast. What I don't know is how
> much work there is to access blocks that are already in the buffer
> cache. In principle I suppose it should be a little slower, but not
> much. I'd like to know, though. I'd do the test myself if I had a
> machine with enough RAM, but I don't. Daniel (apparently) does...
You are probably OK. I used a 150 MB ramdisk, of which I used only
100 MB. That is why I used a 2.2 kernel for my tests.
Daniel
Hi!
> > Everyone who has write cache turned on for their hard drives is
> > running in a mode similar to ramback anyway (except for when the file
> > system is set to force writes to the platter, but that is rare).
> > Admittedly, software crashes rarely cause the write cache to be lost,
> > but hardware failures do, practically every time.
>
> On the contrary - the hard disk cache is managed by the barrier logic in
> the kernel, and the ordering even on failures is fairly predictable.
Well, not even all modern hdds support barriers... It would be really
nice to have _safe_ settings by default here...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> Hi!
>
>>> Everyone who has write cache turned on for their hard drives is
>>> running in a mode similar to ramback anyway (except for when the file
>>> system is set to force writes to the platter, but that is rare).
>>> Admittedly, software crashes rarely cause the write cache to be lost,
>>> but hardware failures do, practically every time.
>> On the contrary - the hard disk cache is managed by the barrier logic in
>> the kernel, and the ordering even on failures is fairly predictable.
>
> Well, not even all modern hdds support barriers... It would be really
> nice to have _safe_ settings by default here...
>
The only really safe default is to disable the write cache by default or
possibly dynamically disable the write cache when barriers are not
supported by a drive. Both have a severe performance impact and I am not
sure that for most casual users it is a good trade.
ric
Ric Wheeler <[email protected]> writes:
> The only really safe default is to disable the write cache by default
> or possibly dynamically disable the write cache when barriers are not
> supported by a drive. Both have a severe performance impact and I am
> not sure that for most casual users it is a good trade.
So people ARE running their disks in a mode similar to Ramback.
/Benny
Benny Amorsen wrote:
> Ric Wheeler <[email protected]> writes:
>
>> The only really safe default is to disable the write cache by default
>> or possibly dynamically disable the write cache when barriers are not
>> supported by a drive. Both have a severe performance impact and I am
>> not sure that for most casual users it is a good trade.
>
> So people ARE running their disks in a mode similar to Ramback.
>
>
> /Benny
>
We have been looking at write performance with RAM disk, battery backed
Clariion array & slow laptop drives in another thread on fs-devel, but
the rough numbers should be interesting.
If you are not doing an fsync() at the end of writing a file, you are
writing to the page cache (as long as it fits in DRAM) so you are
basically getting thousands of small files/sec.
We did a test which showed the following for synchronous (fsync())
writers of small files with a SLES10/SP1 kernel but the results still
hold for upstream kernels (at least for the order of magnitude).
Ramdisk test backed testing showed over 4600 small 4k files/sec with 1
thread.
Midrange array (looks like a ramdisk behind a fibre channel port) hit
around 778 files/sec with 1 thread.
With a local disk, write cached enabled and barriers on, you are getting
around 47 4k files/sec.
The tests were run on ext3, different file systems perform differently
but all fall in the same order of magnitude of performance with the same
class of storage behind it ;-)
ric
On Fri, Mar 14, 2008 at 12:41:31PM +0100, Benny Amorsen wrote:
> Ric Wheeler <[email protected]> writes:
>
> > The only really safe default is to disable the write cache by default
> > or possibly dynamically disable the write cache when barriers are not
> > supported by a drive. Both have a severe performance impact and I am
> > not sure that for most casual users it is a good trade.
>
> So people ARE running their disks in a mode similar to Ramback.
Similar, but not as aggressive. Remember, the size of the write cache
on the hard drive is relatively small (small number of megabytes), and
the drive generally is relatively aggressive about getting the data
out to the platters; it's probably not going to keep unwritten data on
the platters for minutes or hours at a time, let alone days. Of
course, unless you use write barriers or some kind of explicit write
ordering, it's going to write stuff out in an order which is
convenient to the hard drive, not necessarily an order convenient to
the filesystem.
Also, if the system crashes, you don't lose the data in hard drive's
write cache, where as the data in Ramback is likely gone. And Ramback
is apaprently keeping potentially several gigabytes dirty in memory
and *not* writing it out very aggressively. So the exposure is one of
degree.
In practice, it's interesting that we've had so few people reporting
massive data loss despite the lack of the use of write barriers.
Sure, in absolutely critical situations, it's not a good thing; but if
I had a mail server, where I really wanted to make sure I didn't lose
any e-mail, having a small UPS which could keep the server going for
just a few minutes so it could do a controlled shutdown on a power
failure is probably a better engineering solution from a
cost/benefit/performance point of view, compared to turning on write
barriers and taking up to two orders of magnitude worth of performance
hit.
- Ted
>>>>> "Daniel" == Daniel Phillips <[email protected]> writes:
Daniel> On Thursday 13 March 2008 13:27, Rik van Riel wrote:
>> On Thu, 13 Mar 2008 11:14:39 -0800
>> Daniel Phillips <[email protected]> wrote:
>>
>> > Scream is an exaggeration, and FUD only applies to somebody who
>> > consistently overlooks the primary proposition in this design: that the
>> > battery backed power supply, computer hardware and Linux are reliable
>> > enough to entrust your data to them.
>>
>> That's a reasonable enough assumption, to anyone who has never dealt
>> with software before, or whose data is just not important.
>>
>> People who have dealt with computers for longer will know that anything
>> can fail at any time, and usually does unexpectedly and at bad moments.
>>
>> Some defensive programming to deal with random failures could make your
>> project appealing to a lot more people than it would appeal to in its
>> current state.
Daniel> In its current state it has bugs and so should appeal only to
Daniel> programmers who like to work with cutting edge stuff.
As a prof SysAdmin and long time lurker, I feel I can chime in here a
bit. No one is arguing that your code isn't neat, or have a feature
which would be nice to have. They are arguing that your failure mode
(when, not if, it fails for some reason) is horrible.
Who remembers the NFS PrestoServer NFS accelerator cards? You could
buy this PCI (or was it TurboChannel back then?) card for your DEC
Alphas. It came with 4Mb of battery backed RAM so that NFS writes
could be ack'd before being written to disk. We had just completed
moving all the user home directories to this system that week, say
around 4gb of data? Remember, this was around 94 sometime at a
University. We were also using Advfs on DEC OSF/1, probably v1.2,
maybe v1.3.
Anyway, I came into work thursday night to pickup something I had
forgotten before I took a three day weekend. The operator on duty
asked me to look at the server since it had crashed and wasn't coming
up properly.
I ended up staying there until 9am the next morning working on it.
Turned out to be both user and hardware error. We had forgotten to
remove the piece of plastic to enable to battery on the card, but the
circuits on the card lied and said battery voltage was fine no matter
what the battery really was.
So the system crashed. 4Mb of data from the filesystem when bye-bye.
Can you say oops? What a total pain to diagnose. But even on a log
structured filesystem, having 4Mb of data just get wiped out was
enough to destroy all the filesystem.
We ended up rolling back to the original server and junking the week
of changes that users had made, and restoring chunks for users as they
requested it. Luckily, it was early in the semester and not alot of
stuff had gotten done yet.
Now do you see why people are a bit hesitant of this software and it's
useage model? While you might get great performance numbers, just one
crash and the need to restore data from tape (or I'll even give you
that you're doing D2D backups) or other media will destroy your uptime
and overall performance.
Daniel> So long as you keep insisting it has to have some kind of slow
Daniel> transactional sync to disk in order to be reliable enough for
Daniel> enterprise use, I have to leave you in my FUD filter. Did you
Daniel> read Ric's post where he mentions the UPS in some EMS
Daniel> products? Ask yourself, what is the UPS for? Then ask
Daniel> yourself if EMC makes billions of dollars selling those things
Daniel> to enterprise clients.
You cannot compare the design of Ramback to an EMC solution because
you depend on Joe User's random UPS being properly sized, configured,
maintained, etc.
EMC does all that integration work themselves. And they size the UPS
to *only* support their needs, and they *know* those needs down to a
T, so they can make a more certain statement of reliability. But I
bet even they have been burned.
Think belt and suspenders. Be paranoid.
Now, if you could wire up RamBack to work with an addon PCI(*) NVRAM
board of some sort, then I'd possibly be interested in running it,
because then I only have to depend on the battery on the NVRAM board
working right, and that's a simpler set of constraints to confirm.
Again, as a professional SysAdmin, I could *never* justify using
RamBack in my business because the potential downsides do NOT justify
the speedup.
Sure, saving my engineers time by getting them faster disks, more
disk space, bigger RAM and more CPUs is justifiable in a heartbeat.
But I also pay for NetApp NFS fileservers and their reliability and
resilienciecy when they *do* crash. And crash they do, even though
they are not a general use OS.
quoting some number of '9's reliability is all fine and dandy, but my
users will be happy if my file server crashed once a month but came
back up working in two minutes. With Ramback, if the system crashes,
how long will it take for a system to come back up and be useable?
So reliability isn't just about the components, it's about the
service. And a bunch of really short outages doesn't kill me like one
huge outage does, even if I've been getting killer filesystem
performance before and after the outage.
I admit, my work is compute bound, but when users have jobs running
for days and sometimes weeks... downtime, esp downtime with horrible
consequences, isn't acceptable. But downtime that let's them get back
up and working quickly is more acceptable.
This is the point people are trying to make Daniel, that the
consequences of a single RamBack system failure aren't trivial.
Thanks,
John
Theodore Tso wrote:
> On Fri, Mar 14, 2008 at 12:41:31PM +0100, Benny Amorsen wrote:
>> Ric Wheeler <[email protected]> writes:
>>
>>> The only really safe default is to disable the write cache by default
>>> or possibly dynamically disable the write cache when barriers are not
>>> supported by a drive. Both have a severe performance impact and I am
>>> not sure that for most casual users it is a good trade.
>> So people ARE running their disks in a mode similar to Ramback.
>
> Similar, but not as aggressive. Remember, the size of the write cache
> on the hard drive is relatively small (small number of megabytes), and
> the drive generally is relatively aggressive about getting the data
> out to the platters; it's probably not going to keep unwritten data on
> the platters for minutes or hours at a time, let alone days. Of
> course, unless you use write barriers or some kind of explicit write
> ordering, it's going to write stuff out in an order which is
> convenient to the hard drive, not necessarily an order convenient to
> the filesystem.
You get 8-16MB per disk with most drives today. Different firmware will
do different things about how aggressively they push the data out to
platter.
> Also, if the system crashes, you don't lose the data in hard drive's
> write cache, where as the data in Ramback is likely gone. And Ramback
> is apaprently keeping potentially several gigabytes dirty in memory
> and *not* writing it out very aggressively. So the exposure is one of
> degree.
>
> In practice, it's interesting that we've had so few people reporting
> massive data loss despite the lack of the use of write barriers.
> Sure, in absolutely critical situations, it's not a good thing; but if
> I had a mail server, where I really wanted to make sure I didn't lose
> any e-mail, having a small UPS which could keep the server going for
> just a few minutes so it could do a controlled shutdown on a power
> failure is probably a better engineering solution from a
> cost/benefit/performance point of view, compared to turning on write
> barriers and taking up to two orders of magnitude worth of performance
> hit.
>
> - Ted
Most people don't see power outages too often - maybe once a year? When
you travel with a laptop, we are always effectively on a UPS so that
will also tend to mask this issue.
The ingest rate at the time of a power hit makes a huge difference as
well - basically, pulling the power cord when a box is idle is normally
not harmful. Try that when you are really pounding on the disks and you
will see corruptions a plenty without barriers ;-)
One note - the barrier hit for apps that use fsync() is just half an
order of magnitude (say 35 files/sec instead of 120 files/sec). If you
don't fsync() each file, the impact is lower still.
Still expensive, but might be reasonable for home users on a box with
family photos, etc.
ric
On Fri, Mar 14, 2008 at 11:47:04AM -0400, Ric Wheeler wrote:
> The ingest rate at the time of a power hit makes a huge difference as well
> - basically, pulling the power cord when a box is idle is normally not
> harmful. Try that when you are really pounding on the disks and you will
> see corruptions a plenty without barriers ;-)
Oh, no question. But the fact that it mostly works when the box is
idle means the hard drive firmware is reasonably aggressive about
pushing data from the write cache out to the platters when it can.
> One note - the barrier hit for apps that use fsync() is just half an order
> of magnitude (say 35 files/sec instead of 120 files/sec). If you don't
> fsync() each file, the impact is lower still.
>
> Still expensive, but might be reasonable for home users on a box with
> family photos, etc.
It depends on the workload, obviously. I thought I remember someone
on this thread talking about benchmark where they went from ~2000 to
~20 ops/sec once they added fsync(). I'm sure that was an extreme
benchmarking workload that isn't at all representative of real-life
usage, where you're usually do something else modifying the metadata
of many tiny files over and over again. :-)
It's also the case that a home user's fileserver is generally
quiscent, which is probably why we aren't hearing lots of stories
about home NAS servers (which I bet probably don't enable write
barriers) trashing vast amounts of user data.....
- Ted
Theodore Tso wrote:
> On Fri, Mar 14, 2008 at 11:47:04AM -0400, Ric Wheeler wrote:
>> The ingest rate at the time of a power hit makes a huge difference as well
>> - basically, pulling the power cord when a box is idle is normally not
>> harmful. Try that when you are really pounding on the disks and you will
>> see corruptions a plenty without barriers ;-)
>
> Oh, no question. But the fact that it mostly works when the box is
> idle means the hard drive firmware is reasonably aggressive about
> pushing data from the write cache out to the platters when it can.
>
>> One note - the barrier hit for apps that use fsync() is just half an order
>> of magnitude (say 35 files/sec instead of 120 files/sec). If you don't
>> fsync() each file, the impact is lower still.
>>
>> Still expensive, but might be reasonable for home users on a box with
>> family photos, etc.
>
> It depends on the workload, obviously. I thought I remember someone
> on this thread talking about benchmark where they went from ~2000 to
> ~20 ops/sec once they added fsync(). I'm sure that was an extreme
> benchmarking workload that isn't at all representative of real-life
> usage, where you're usually do something else modifying the metadata
> of many tiny files over and over again. :-)
I think those were the numbers comparing a ramdisk, s-ata drive and a
clariion all doing barriers ;-)
I just reran some quick tests on a home box with a s-ata drive writing
50k files single threaded:
barriers off & fsync: 133 files/sec
barriers off & no fsync: 2306 files/sec
barriers on & no fsync: 2312 files/sec
barriers on & fsync: 22 files/sec
So no slowdown without fsync & a 5x slowdown when you fsync every write.
Doing the fsync is the only way to make (mostly sure) that all data is
on platter, but you can write the files in a batch and then go back and
reopen/fsync/close all files afterwards.
That helps a lot:
barriers on & bulk fsync (in order written) : 218 files/sec
barriers on & bulk fsync (reverse order written) : 340 files/sec
All of this was measured with my fs_mark tool (it is also on
sourceforge) with variations of the following:
fs_mark -d /home/ric/test -n 10000 -D 20 -N 500
(-S 0 no fsync, -S 1 fsync per file as written, -S 2 bulk reverse order,
-S 3 bulk in order).
> It's also the case that a home user's fileserver is generally
> quiscent, which is probably why we aren't hearing lots of stories
> about home NAS servers (which I bet probably don't enable write
> barriers) trashing vast amounts of user data.....
>
> - Ted
A lot of NAS boxes and storage boxes in general disable all write cache
on drives just to be safe. It would be interesting to benchmark the nfs
server with and without barriers from a client ;-)
ric
Daniel Phillips <[email protected]> writes:
> On Thursday 13 March 2008 12:50, David Newall wrote:
>> Daniel Phillips wrote:
>> > The period where you cannot access the data is downtime. If your script
>> > just does a cp from a disk array to the ram device you cannot just read
>> > from the backing store in that period because you will need to fail over
>> > to the ramdisk at some point, and you cannot just read from the ramdisk
>> > because it is not populated yet.
>>
>> Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
>
> In raid1, write completion has to wait for write completion on all
> mirror members, so writes run at disk speed. Reads run at ramdisk
> speed, so your proposal sounds useful, but ramback aims for high
> write performance as well.
Ramback could be an interesting building block. Consider using a
couple of systems exporting Ramback devices via Evgeniy's distributed
storage target (or something similiar). In this case, you can have as
many Ramback devices as you want comprise your mirror set to meet your
availability requirements. Perhaps people are looking at this too
much as an entire solution as opposed to a piece of a bigger puzzle.
I think the idea has merit.
Cheers,
Jeff
On Fri, 14 Mar 2008, Theodore Tso wrote:
> It depends on the workload, obviously. I thought I remember someone
> on this thread talking about benchmark where they went from ~2000 to
> ~20 ops/sec once they added fsync(). I'm sure that was an extreme
> benchmarking workload that isn't at all representative of real-life
> usage, where you're usually do something else modifying the metadata
> of many tiny files over and over again. :-)
I've seen this sort of thing with syslog, normal syslog at ~100 logs/sec,
syslog without fsync >10,000 logs/sec.
if this is your situation then battery backed cache on your controller is
the answer as it gives you almost full speed with the safety of fsync.
David Lang
Hi!
> The ingest rate at the time of a power hit makes a huge
> difference as well - basically, pulling the power cord
> when a box is idle is normally not harmful. Try that
> when you are really pounding on the disks and you will
> see corruptions a plenty without barriers ;-)
I tried that, and could not get a corrruption. cp -a on big kernel
trees, on sata disk with writeback cache and no barriers... and I
could not cause fs corruption. ext3.
I'd like to demo danger of writeback cache. What should I do?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Fri, Mar 14, 2008 at 08:03:57PM +0100, Pavel Machek wrote:
>
> > The ingest rate at the time of a power hit makes a huge
> > difference as well - basically, pulling the power cord
> > when a box is idle is normally not harmful. Try that
> > when you are really pounding on the disks and you will
> > see corruptions a plenty without barriers ;-)
>
> I tried that, and could not get a corrruption. cp -a on big kernel
> trees, on sata disk with writeback cache and no barriers... and I
> could not cause fs corruption. ext3.
>
> I'd like to demo danger of writeback cache. What should I do?
Ext3's journal probably hides a huge number of problems. I'd try
something with a lot more parallel modifications to metadata. Say
postmark with a large number of threads. It would be interesting
actually to get some controlled results of exactly how busy a
filesystem has to be before you get filesystem corruption (which I
would check explicitly running "e2fsck -f" e2fsck after pulling the
plug on the drive).
- Ted
On Wed 2008-03-12 22:50:55, Daniel Phillips wrote:
> On Wednesday 12 March 2008 23:30, David Newall wrote:
> > Daniel Phillips wrote:
> > >> Your idea seems predicated on throwing large amounts of RAM at the
> > >> problem. What I want to know is this: Is it really 25 times faster than
> > >> ext3 with an equally huge buffer cache?
> > >
> > > Yes.
> >
> > Well, that sounds convincing. Not. You know this how?
>
> By measuring it. time untar -xf linux-2.2.26.tar; time sync
Thats cheating. Your ramback ignores sync.
Just time it against ext3 _without_ doing the sync. That's still more
reliable than what you have.
Heck, comment out sync and fsync from your kernel. You'll likely be 10
times normal speed, and still more reliable than ramback.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi!
> > if you have a reliable UPS and are willing to rely on it to save your data
> > take the identical hardware to what you are planning to use, but instead
> > of using your driver just create a ramdisk and load it on boot and save
> > the contents on shutdown.
>
> Aha! You are getting close. Really, that is all ramback does. It
> just handles some very difficult related issues efficiently, in such a
> way as to minimize any denial of service from complete loss of UPS
> power. This is all just about using power management in a new way that
> gets higher performance. But your battery power has to be reliable.
> Just make it so. It is not difficult these days, or even particularly
> expensive.
>
> I calculated somewhere along the line that it would take something like
> 17 minutes to populate the big Violin ramdisk initially, and 17 minutes
> to save it during a loss of line power event, during which UPS power
> must be not run out before ramback achieves disk sync or you will get
> file corruption. (This rule was mentioned in my original post.)
>
> All well and could, you can in fact do that with a pretty simple script.
> But in the initial 17 minutes your application may not read or write
> the ramdisk data and in the closing 17 minutes it may not write. That
> knocks your system down to 4 nines, given one planned shutdown per year.
> Not good, not good at all.
Hmm, what happens if applications keep dirtying so much data you miss
your 17minute deadline?
Anyway...
ext2
+ lots of memory
+ tweaked settings of kflushd (only write data older than 10 years)
+ just not using sync/fsync except during shutdown
+ find / | xargs cat
...is ramback, right? Should have same performance, and you can still
read/write during that 17+17 minutes.
Ok, find | xargs might be slower... but we probably want to fix that
anyway....
It has big advantage: if you only tell kflushd to hold up writes for
an hour, you loose a little in performance and gain a lot in
reliability...
(If ext2+tweaks is slower than ramback, we have a bug to fix, I'm
afraid).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Saturday 15 March 2008 06:32, Pavel Machek wrote:
> On Wed 2008-03-12 22:50:55, Daniel Phillips wrote:
> > On Wednesday 12 March 2008 23:30, David Newall wrote:
> > > Daniel Phillips wrote:
> > > >> Your idea seems predicated on throwing large amounts of RAM at the
> > > >> problem. What I want to know is this: Is it really 25 times faster than
> > > >> ext3 with an equally huge buffer cache?
> > > >
> > > > Yes.
> > >
> > > Well, that sounds convincing. Not. You know this how?
> >
> > By measuring it. time untar -xf linux-2.2.26.tar; time sync
>
> Thats cheating. Your ramback ignores sync.
>
> Just time it against ext3 _without_ doing the sync. That's still more
> reliable than what you have.
No, that allows ext3 to cheat, because ext3 does not supply any means
of flushing its cached data to disk in response to loss of line power,
and then continuing on in a "safe" mode until line power comes back.
Fix that and you will have a replacement for ramback, arguably a more
efficient one for this specialized application (it will not work for an
external ramdisk). Until you do that, ramback is the only game in town
to get these transaction speeds together with data durability.
I have mentioned a number of times, that you _already_ rely on an
equivalent scheme to ramback if you are using a battery-backed raid
controller. Somehow, posters to this thread keep glossing over that
and going back to the sky-is-falling argument.
Daniel
On Thu 2008-03-13 12:03:03, Daniel Phillips wrote:
> On Thursday 13 March 2008 12:50, David Newall wrote:
> > Daniel Phillips wrote:
> > > The period where you cannot access the data is downtime. If your script
> > > just does a cp from a disk array to the ram device you cannot just read
> > > from the backing store in that period because you will need to fail over
> > > to the ramdisk at some point, and you cannot just read from the ramdisk
> > > because it is not populated yet.
> >
> > Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
>
> In raid1, write completion has to wait for write completion on all
> mirror members, so writes run at disk speed. Reads run at ramdisk
> speed, so your proposal sounds useful, but ramback aims for high
> write performance as well.
raid1 + kflushd tweak?
special raid1 mode that signals completion when it hits _one_ of the
drives, and does sync when the slower drive is idle?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sat, Mar 15, 2008 at 4:26 PM, Pavel Machek <[email protected]> wrote:
> On Thu 2008-03-13 12:03:03, Daniel Phillips wrote:
> > On Thursday 13 March 2008 12:50, David Newall wrote:
> > > Daniel Phillips wrote:
> > > > The period where you cannot access the data is downtime. If your script
> > > > just does a cp from a disk array to the ram device you cannot just read
> > > > from the backing store in that period because you will need to fail over
> > > > to the ramdisk at some point, and you cannot just read from the ramdisk
> > > > because it is not populated yet.
> > >
> > > Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
> >
> > In raid1, write completion has to wait for write completion on all
> > mirror members, so writes run at disk speed. Reads run at ramdisk
> > speed, so your proposal sounds useful, but ramback aims for high
> > write performance as well.
>
> raid1 + kflushd tweak?
>
> special raid1 mode that signals completion when it hits _one_ of the
> drives, and does sync when the slower drive is idle?
raid1 already supports marking member(s) as write-mostly. Any
write-mostly member can also make use of write-behind mode (provided
you have a write intent bitmap).
Hi Pavel,
On Saturday 15 March 2008 13:18, Pavel Machek wrote:
> Hmm, what happens if applications keep dirtying so much data you miss
> your 17minute deadline?
Ramback is supposed to prevent that by allowing only a limited amount
of application IO during flush mode. Currently this is accomplished
by making each application write wait synchronously on the one before
it, until flushing completes. This allows only a small amount of
application traffic, something like 5% bandwidth. This solution is
admittedly crude, and over time it will be improved to look more like
a realtime scheduler, because this is in fact a realtime scheduling
problem.
Once flushing completes, application writes are still serialized and
thus slow, which is a stronger condition than necessary to maintain
transactional integrity for the filesystem. Eventually this will be
optimized.
For now, the maximum flush is only a few hundred MB on my workstation,
which leaves a huge safety margin even with my $100 UPS. And the risk,
however small, of having to run a lossy e2fsck because the battery got
old and the power did run out, is mitigated by the fact that ramback
runs on my kernel hacking partition, and everything unique there just
gets uploaded to the internet regularly anyway. This serves as my
replication algorithm. Note: I strongly recommend that any critical
data entrusted to ramback be replicated to mitigate the risk of system
failure, however small.
> Anyway...
> ext2
> + lots of memory
> + tweaked settings of kflushd (only write data older than 10 years)
> + just not using sync/fsync except during shutdown
> + find / | xargs cat
>
> ...is ramback, right? Should have same performance, and you can still
> read/write during that 17+17 minutes.
No, you are missing some essential pieces. Ramback has two operating
modes:
1) writeback (when ups-backed line power is available)
2) writethrough (when running on ups power)
Plus, it has the daemon driven flushing for ups mode, and daemon driven
one-pass populating for startup mode. That is all ramback is, but you
do not quite get there with your solution above.
Also, ramback works with generic block devices, opening up a wide range
of applications that your proposal does not.
> Ok, find | xargs might be slower... but we probably want to fix that
> anyway....
We sure do. Readahead sucks enormously in Linux.
> It has big advantage: if you only tell kflushd to hold up writes for
> an hour, you loose a little in performance and gain a lot in
> reliability...
>
> (If ext2+tweaks is slower than ramback, we have a bug to fix, I'm
> afraid).
I hope that my work inspires other people like you to go in and work
on some of the VM/VFS/BIO brokenness that helps make ramback such a
big win. In the meantime, it is useful to be clear on just what we
have here, and why some people care about it a lot.
Daniel
On Thu, Mar 13, 2008 at 11:14:39AM -0800, Daniel Phillips wrote:
> On Thursday 13 March 2008 06:22, Alan Cox wrote:
> > ...Ext3 cannot recover well from massive loss of intermediate
> > writes. It isn't a normal failure mode and there isn't sufficient fs
> > metadata robustness for this. A log structured backing store would deal
> > with that but all you apparently want to do is scream FUD at anyone who
> > doesn't agree with you.
>
> Scream is an exaggeration, and FUD only applies to somebody who
> consistently overlooks the primary proposition in this design: that the
> battery backed power supply, computer hardware and Linux are reliable
> enough to entrust your data to them. I say this is practical, you say
> it is impossible, I say FUD.
>
> All you are proposing is that nobody can entrust their data to any
> hardware. Good point. There is no absolute reliability, only degrees
> of it.
>
> Many raid controllers now have battery backed writeback cache, which
> is exactly the same reliability proposition as ramback, on a smaller
> scale. Do you refuse to entrust your corporate data to such
> controllers?
RAID controllers do not have half a terabyte of RAM. Also, you are always
invited to choose between speed (write back) and reliability (write through).
Also, please note that the problem here is not related to the number of
nines of availability. This number only counts the ratio between uptime
and downtime. We're more facing a problem of MTBF, where the consequences
of a failure are hard to predict.
What I'm thinking about is that considering the fact that storage
technologies are moving towards SSD (and I think 2008 will be the
year of SSD), you should implement ordered writes (I've not said
write through) since there's no seek time on those devices. Thus
you will have the speed of RAM with the reliability of a properly
synced FS. If your system crashes once a week, it will not be a
problem anymore.
Willy
On Saturday 15 March 2008 13:26, Pavel Machek wrote:
> On Thu 2008-03-13 12:03:03, Daniel Phillips wrote:
> > On Thursday 13 March 2008 12:50, David Newall wrote:
> > > Daniel Phillips wrote:
> > > > The period where you cannot access the data is downtime. If your script
> > > > just does a cp from a disk array to the ram device you cannot just read
> > > > from the backing store in that period because you will need to fail over
> > > > to the ramdisk at some point, and you cannot just read from the ramdisk
> > > > because it is not populated yet.
> > >
> > > Wouldn't a raid-1 set comprising disk + ramdisk do that with no downtime?
> >
> > In raid1, write completion has to wait for write completion on all
> > mirror members, so writes run at disk speed. Reads run at ramdisk
> > speed, so your proposal sounds useful, but ramback aims for high
> > write performance as well.
>
> raid1 + kflushd tweak?
> special raid1 mode that signals completion when it hits _one_ of the
> drives, and does sync when the slower drive is idle?
Feel free :-)
This is very close to how ramback already works. One subtlety is that
ramback does not write twice from the same application data source,
which could allow the data on the backing device to differ from the
ramdisk if the user changes it during the write. I don't know how
important it is to protect against this bug actually, but there you
have it. Ramback can easily to changed to write twice from the same
source just like a raid1 (in fact it originally was that way) which
would make it even more like raid1.
Adding ramback-like functionality to raid1 would be a nice
contribution. I would fully support that but I do not have time to do
it myself.
Daniel
On Saturday 15 March 2008 13:59, Willy Tarreau wrote:
> On Thu, Mar 13, 2008 at 11:14:39AM -0800, Daniel Phillips wrote:
> > On Thursday 13 March 2008 06:22, Alan Cox wrote:
> > > ...Ext3 cannot recover well from massive loss of intermediate
> > > writes. It isn't a normal failure mode and there isn't sufficient fs
> > > metadata robustness for this. A log structured backing store would deal
> > > with that but all you apparently want to do is scream FUD at anyone who
> > > doesn't agree with you.
> >
> > Scream is an exaggeration, and FUD only applies to somebody who
> > consistently overlooks the primary proposition in this design: that the
> > battery backed power supply, computer hardware and Linux are reliable
> > enough to entrust your data to them. I say this is practical, you say
> > it is impossible, I say FUD.
> >
> > All you are proposing is that nobody can entrust their data to any
> > hardware. Good point. There is no absolute reliability, only degrees
> > of it.
> >
> > Many raid controllers now have battery backed writeback cache, which
> > is exactly the same reliability proposition as ramback, on a smaller
> > scale. Do you refuse to entrust your corporate data to such
> > controllers?
>
> RAID controllers do not have half a terabyte of RAM.
And? Either you have battery backed ram with critical data in it or
you do not. Exactly how much makes little difference to the question.
> Also, you are always
> invited to choose between speed (write back) and reliability (write through).
As is the case with ramback. Just echo 1 >/proc/driver/ramback/<name>.
> Also, please note that the problem here is not related to the number of
> nines of availability. This number only counts the ratio between uptime
> and downtime. We're more facing a problem of MTBF, where the consequences
> of a failure are hard to predict.
That is why I keep recommending that a ramback setup be replicated or
mirrored, which people in this thread keep glossing over. When
replicated or mirrored, you still get the microsecond-level transaction
times, and you get the safety too.
Then there is a big class of applications where the data on the ramdisk
can be reconstructed, it is just a pain and reduces uptime. These are
potential ramback users, and in fact I will be one of those, using it
on my kernel hacking partition.
> What I'm thinking about is that considering the fact that storage
> technologies are moving towards SSD (and I think 2008 will be the
> year of SSD), you should implement ordered writes (I've not said
> write through) since there's no seek time on those devices. Thus
> you will have the speed of RAM with the reliability of a properly
> synced FS. If your system crashes once a week, it will not be a
> problem anymore.
There will be a whole bunch of patches from me that are SSD oriented,
over time. The fact is, enterprise scale ramdisks are here now, while
enterprise scale flash is not. Getting close, but not here. And flash
does not approach the write performance of RAM, not now and probably
not ever.
Daniel
> RAID controllers do not have half a terabyte of RAM. Also, you are always
> invited to choose between speed (write back) and reliability (write through).
The write back ones are also battery backed properly, and will switched
to write through (flushing out the cache) on the first sniff of a low
battery signal.
The decent ones (the kind used in serious business) also let you swap the
battery backed RAM module to another card in the event of a failure of a
card so you can complete recovery.
> > RAID controllers do not have half a terabyte of RAM.
>
> And? Either you have battery backed ram with critical data in it or
> you do not. Exactly how much makes little difference to the question.
It makes a lot of difference, and in addition raid controllers (good
ones) respect barrier ordering in their RAM cache so they'll take tags or
similar interfaces and honour them.
> That is why I keep recommending that a ramback setup be replicated or
> mirrored, which people in this thread keep glossing over. When
> replicated or mirrored, you still get the microsecond-level transaction
> times, and you get the safety too.
Either you keep a mirror in sync and get normal data rates or you keep
the mirror out of sync and then you need to sort your writeback process
out to preserve ordering.
If you want ramback to be taken seriously then that is the interesting
problem to solve and clearly has multiple solutions if you would start to
take an objective look at your work.
On Saturday 15 March 2008 13:56, Alan Cox wrote:
> > RAID controllers do not have half a terabyte of RAM. Also, you are always
> > invited to choose between speed (write back) and reliability (write through).
>
> The write back ones are also battery backed properly, and will switched
> to write through (flushing out the cache) on the first sniff of a low
> battery signal.
In other words, exactly how ramback works.
> The decent ones (the kind used in serious business) also let you swap the
> battery backed RAM module to another card in the event of a failure of a
> card so you can complete recovery.
Right, just like the Violin 1010, whose PCI-e cable can be hotplugged
into a different server. Or plugged into two servers at the same time,
because each 1010 has two PCI-e interfaces, so this can be done without
manual intervention.
See, we really are talking about the same thing. Except that ramback
does it bigger and faster.
Daniel
On Sat, 15 Mar 2008 13:25:48 -0800
Daniel Phillips <[email protected]> wrote:
> On Saturday 15 March 2008 13:56, Alan Cox wrote:
> > > RAID controllers do not have half a terabyte of RAM. Also, you are always
> > > invited to choose between speed (write back) and reliability (write through).
> >
> > The write back ones are also battery backed properly, and will switched
> > to write through (flushing out the cache) on the first sniff of a low
> > battery signal.
>
> In other words, exactly how ramback works.
No because you don't honour the ordering and tag boundaries as they do.
Alan
On Sat 2008-03-15 12:22:47, Daniel Phillips wrote:
> On Saturday 15 March 2008 06:32, Pavel Machek wrote:
> > On Wed 2008-03-12 22:50:55, Daniel Phillips wrote:
> > > On Wednesday 12 March 2008 23:30, David Newall wrote:
> > > > Daniel Phillips wrote:
> > > > >> Your idea seems predicated on throwing large amounts of RAM at the
> > > > >> problem. What I want to know is this: Is it really 25 times faster than
> > > > >> ext3 with an equally huge buffer cache?
> > > > >
> > > > > Yes.
> > > >
> > > > Well, that sounds convincing. Not. You know this how?
> > >
> > > By measuring it. time untar -xf linux-2.2.26.tar; time sync
> >
> > Thats cheating. Your ramback ignores sync.
> >
> > Just time it against ext3 _without_ doing the sync. That's still more
> > reliable than what you have.
>
> No, that allows ext3 to cheat, because ext3 does not supply any means
> of flushing its cached data to disk in response to loss of line power,
> and then continuing on in a "safe" mode until line power comes back.
Ok, it seems like "ignore sync/fsync unless on UPS power" is what you
really want? That should be easy enough to implement, either in
kernelor as a LD_PRELOAD hack.
So... untar with sync is fair benchmark against ramback on UPS power
and untar without sync is fair benchmark against ramback on AC power.
But you did untar with sync against ramback on AC power.
That's wrong.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
pomozte zachranit klanovicky les: http://www.ujezdskystrom.info/
On Saturday 15 March 2008 14:33, Pavel Machek wrote:
> On Sat 2008-03-15 12:22:47, Daniel Phillips wrote:
> > On Saturday 15 March 2008 06:32, Pavel Machek wrote:
> > > On Wed 2008-03-12 22:50:55, Daniel Phillips wrote:
> > > > On Wednesday 12 March 2008 23:30, David Newall wrote:
> > > > > Daniel Phillips wrote:
> > > > > >> Your idea seems predicated on throwing large amounts of RAM at the
> > > > > >> problem. What I want to know is this: Is it really 25 times faster than
> > > > > >> ext3 with an equally huge buffer cache?
> > > > > >
> > > > > > Yes.
> > > > >
> > > > > Well, that sounds convincing. Not. You know this how?
> > > >
> > > > By measuring it. time untar -xf linux-2.2.26.tar; time sync
> > >
> > > Thats cheating. Your ramback ignores sync.
> > >
> > > Just time it against ext3 _without_ doing the sync. That's still more
> > > reliable than what you have.
> >
> > No, that allows ext3 to cheat, because ext3 does not supply any means
> > of flushing its cached data to disk in response to loss of line power,
> > and then continuing on in a "safe" mode until line power comes back.
>
> Ok, it seems like "ignore sync/fsync unless on UPS power" is what you
> really want? That should be easy enough to implement, either in
> kernelor as a LD_PRELOAD hack.
Sure, let's try it and then we will have a race. I would be happy to
lose that race, but... let's just see who wins.
> So... untar with sync is fair benchmark against ramback on UPS power
> and untar without sync is fair benchmark against ramback on AC power.
>
> But you did untar with sync against ramback on AC power.
>
> That's wrong.
It is consistent and correct. You need to supply the missing features
that ramback supplies before you have a filesystem-level solution. I
really encourage you to try it, then we can compare the two approaches
with both of them fully working.
Daniel
On Saturday 15 March 2008 14:08, Alan Cox wrote:
> On Sat, 15 Mar 2008 13:25:48 -0800
> Daniel Phillips <[email protected]> wrote:
> > On Saturday 15 March 2008 13:56, Alan Cox wrote:
> > > > RAID controllers do not have half a terabyte of RAM. Also, you are always
> > > > invited to choose between speed (write back) and reliability (write through).
> > >
> > > The write back ones are also battery backed properly, and will switched
> > > to write through (flushing out the cache) on the first sniff of a low
> > > battery signal.
> >
> > In other words, exactly how ramback works.
>
> No because you don't honour the ordering and tag boundaries as they do.
Sophism. The statement was "battery backed properly" and "switch on
first sniff", which is example how ramback works.
Daniel
On Sat, Mar 15, 2008 at 01:17:13PM -0800, Daniel Phillips wrote:
> On Saturday 15 March 2008 13:59, Willy Tarreau wrote:
> > On Thu, Mar 13, 2008 at 11:14:39AM -0800, Daniel Phillips wrote:
> > > On Thursday 13 March 2008 06:22, Alan Cox wrote:
> > > > ...Ext3 cannot recover well from massive loss of intermediate
> > > > writes. It isn't a normal failure mode and there isn't sufficient fs
> > > > metadata robustness for this. A log structured backing store would deal
> > > > with that but all you apparently want to do is scream FUD at anyone who
> > > > doesn't agree with you.
> > >
> > > Scream is an exaggeration, and FUD only applies to somebody who
> > > consistently overlooks the primary proposition in this design: that the
> > > battery backed power supply, computer hardware and Linux are reliable
> > > enough to entrust your data to them. I say this is practical, you say
> > > it is impossible, I say FUD.
> > >
> > > All you are proposing is that nobody can entrust their data to any
> > > hardware. Good point. There is no absolute reliability, only degrees
> > > of it.
> > >
> > > Many raid controllers now have battery backed writeback cache, which
> > > is exactly the same reliability proposition as ramback, on a smaller
> > > scale. Do you refuse to entrust your corporate data to such
> > > controllers?
> >
> > RAID controllers do not have half a terabyte of RAM.
>
> And? Either you have battery backed ram with critical data in it or
> you do not. Exactly how much makes little difference to the question.
It completely changes the method to power it and the time the data may
remain in RAM. The Smart 3200 I have right here simply has lithium
batteries directly connected to the static RAM chips. Very low risk of
power failure. The way your presented your work shows it rely on a UPS
to sustain the PC's power supply, which it turn maintains the PC alive,
which in turn tries not to reboot to keep its RAM consistent. There are
a lot of reasons here to get a failure.
Don't get me wrong, I still think your project has a lot of usages. But
you have to admit that there are huge differences between using it in
an appliance with battery-backed RAM which is able to recover data after
a system crash, power outage or anything, and the average Joe's PC setup
as an NFS server for the company with a cheap UPS to try not to lose the
data should a power outage occur.
I think it could get major adoption with ordered writes.
> > Also, you are always
> > invited to choose between speed (write back) and reliability (write through).
>
> As is the case with ramback. Just echo 1 >/proc/driver/ramback/<name>.
>
> > Also, please note that the problem here is not related to the number of
> > nines of availability. This number only counts the ratio between uptime
> > and downtime. We're more facing a problem of MTBF, where the consequences
> > of a failure are hard to predict.
>
> That is why I keep recommending that a ramback setup be replicated or
> mirrored, which people in this thread keep glossing over. When
> replicated or mirrored, you still get the microsecond-level transaction
> times, and you get the safety too.
I agree, but in this case, you should present it this way. You have been
insisting too much on the average PC's reliability, the fact that no kernel
ever crashed for you, etc... So you are demonstrating that your product is
good provided that everything goes perfectly. All people who have experienced
software or hardware problems in the past (ie mostly everyone here) will not
trust your code because it relies on pre-requisites they know they do not
have.
> Then there is a big class of applications where the data on the ramdisk
> can be reconstructed, it is just a pain and reduces uptime. These are
> potential ramback users, and in fact I will be one of those, using it
> on my kernel hacking partition.
>
> > What I'm thinking about is that considering the fact that storage
> > technologies are moving towards SSD (and I think 2008 will be the
> > year of SSD), you should implement ordered writes (I've not said
> > write through) since there's no seek time on those devices. Thus
> > you will have the speed of RAM with the reliability of a properly
> > synced FS. If your system crashes once a week, it will not be a
> > problem anymore.
>
> There will be a whole bunch of patches from me that are SSD oriented,
> over time. The fact is, enterprise scale ramdisks are here now, while
> enterprise scale flash is not. Getting close, but not here. And flash
> does not approach the write performance of RAM, not now and probably
> not ever.
My goal is not to replace RAM with flash, but disk with flash. You are
against ordered writes for a performance reason. Use SSD instead of
hard drives and it will be as fast as sequential writes. Also, when
you say that enterprise scale flash is not there, I don't agree. You
can already afford hundreds of gigs of flash in 3,5" form factor. An
1.6 TB SSD has even been presented at CES2008, with sales announced
for Q3. So clearly this will replace your hard drives soon, very soon.
Even if it costs $5k, that's a very acceptable solution to replace a
disk in a RAM-speed appliance.
Willy
On Saturday 15 March 2008 14:03, Alan Cox wrote:
> > > RAID controllers do not have half a terabyte of RAM.
> >
> > And? Either you have battery backed ram with critical data in it or
> > you do not. Exactly how much makes little difference to the question.
>
> It makes a lot of difference,
It makes a difference of degree, not of kind.
> and in addition raid controllers (good
> ones) respect barrier ordering in their RAM cache so they'll take tags or
> similar interfaces and honour them.
Ramback should obviously respect barriers, and it does, though at
present only in the crude, default way of letting the block layer
handle it.
But interpreting a barrier to mean flush through to rotating media...
performance will drop to the millisecond per transaction zone, like a
normal disk. Not what ramback users want in normal operating mode.
Flush mode, yes.
Even raid controllers... so you agree that some of them just don't
respond conservatively to tagged commands, either because the engineers
don't know how to implement that (unlikely) or because they want to win
the performance benchmarks, and they do trust their battery?
"Some raid controllers" is just as good for my argument as "all raid
controllers". Nobody is telling you which raid controller to use in
your own personal system. I will pick the fast one and you can pick
the slow one that does not trust its own battery circuits.
> > That is why I keep recommending that a ramback setup be replicated or
> > mirrored, which people in this thread keep glossing over. When
> > replicated or mirrored, you still get the microsecond-level transaction
> > times, and you get the safety too.
>
> Either you keep a mirror in sync and get normal data rates or you keep
> the mirror out of sync and then you need to sort your writeback process
> out to preserve ordering.
>
> If you want ramback to be taken seriously then that is the interesting
> problem to solve and clearly has multiple solutions if you would start to
> take an objective look at your work.
Ramback already is taken seriously, just not by you. That is fine, you
apparently do not need or want the speed.
Anyway, please do not get the impression that I am ignoring your ideas.
There are some nice, intermediate modes that ramback could and in my
opinion, should implement, to give users more options on how to trade
off performance against resilience. I just need to make it clear that
ramback, as conceived, already gives system builders the capability
they need to achieve microsecond level transaction throughput and data
safety at the same time... given a reliable battery, which is where we
started.
Daniel
On Saturday 15 March 2008 14:54, Willy Tarreau wrote:
> I think it could get major adoption with ordered writes.
It already has ordered write when it is in flush mode.
OK, I hear you. There will be an ordered write mode that uses barriers
to decide the ordering. It will greatly reduce the speed at which
ramback can flush dirty data because of the need to wait synchronously
on every barrier, of which there are many. And thus will widen out the
window during which UPS power must remain available if power goes out,
in order to get all acknowledged transactions on to stable media. The
advantage is, the stable media always has a point-in-time version of
the filesystem.
Don't expect this mode in the immediate future though, there are bugs
to fix in the current driver, which already implements the required
performance and stability requirements for a broad range of users.
> > That is why I keep recommending that a ramback setup be replicated or
> > mirrored, which people in this thread keep glossing over. When
> > replicated or mirrored, you still get the microsecond-level transaction
> > times, and you get the safety too.
>
> I agree, but in this case, you should present it this way. You have been
> insisting too much on the average PC's reliability, the fact that no kernel
> ever crashed for you, etc... So you are demonstrating that your product is
> good provided that everything goes perfectly. All people who have experienced
> software or hardware problems in the past (ie mostly everyone here) will not
> trust your code because it relies on pre-requisites they know they do not
> have.
That would have been a miscommunication then. I see arguments coming
in that suggest embedded solutions, EMC for example, are inherently more
reliable than a Linux based solution. Well guess what? Some of those
embedded solutions already use Linux.
Also, peecees are much more reliable than people give them credit for,
especially if you harden up the obvious points of failure such as fans
and spinning disks. Once you have your system all hardened up, then
you _still_ better replicate your important data. Perhaps I should not
admit this, but I simply fail to do that on the machine from which I am
posting right now, which also runs my web server and mail system. That
is because I would have to reboot it to install ddsnap so I can replicate
properly, and because the thing is so darn reliable that I just have
not gotten around to it. I do copy off the important files from time
to time though, and do various other things to ameliorate the risk. If
it was enterprise data I would obviously do a lot more.
> > There will be a whole bunch of patches from me that are SSD oriented,
> > over time. The fact is, enterprise scale ramdisks are here now, while
> > enterprise scale flash is not. Getting close, but not here. And flash
> > does not approach the write performance of RAM, not now and probably
> > not ever.
>
> My goal is not to replace RAM with flash, but disk with flash.
My immediate goal is to replace disk with RAM.
> You are
> against ordered writes for a performance reason. Use SSD instead of
> hard drives and it will be as fast as sequential writes. Also, when
> you say that enterprise scale flash is not there, I don't agree. You
> can already afford hundreds of gigs of flash in 3,5" form factor. An
> 1.6 TB SSD has even been presented at CES2008, with sales announced
> for Q3. So clearly this will replace your hard drives soon, very soon.
> Even if it costs $5k, that's a very acceptable solution to replace a
> disk in a RAM-speed appliance.
Exactly what I mean: close but not there. Those gigantic RAM boxes are
shipping now, and the same company has got a 5 TB flash box coming down
the pipe, and sooner than Q3. But the RAM box will always outperform
the flash box. You just keep throwing writes at it until all available
flash is in erase mode, and the thing slows down. If that is not a
problem for you, then great, you can also save a lot of money by going
with flash. But if nothing less than the ultimate in performance will
do, RAM is the only way to go.
Daniel
On Sat, 15 Mar 2008, Daniel Phillips wrote:
> On Saturday 15 March 2008 14:54, Willy Tarreau wrote:
>> I think it could get major adoption with ordered writes.
>
> It already has ordered write when it is in flush mode.
>
> OK, I hear you. There will be an ordered write mode that uses barriers
> to decide the ordering. It will greatly reduce the speed at which
> ramback can flush dirty data because of the need to wait synchronously
> on every barrier, of which there are many. And thus will widen out the
> window during which UPS power must remain available if power goes out,
> in order to get all acknowledged transactions on to stable media. The
> advantage is, the stable media always has a point-in-time version of
> the filesystem.
it will mean that the window is larger, but it will also mean that if
something else goes wrong and that window is not available the data that
was written out will be useable (recent data will be lost, but older data
will still be available)
as for things that can go wrong
the UPS battery can go bad
you can have multiple power failures in a short time so your battery is not fully charged
capacitors in the UPS can go bad
capacitors in the power supply can go bad
capacitors on the motherboard can go bad
a kernel bug can crash the system
a bug in a device driver (say nvidia graphics driver) can crash the system
a card in the system can lock up the system bus
the system power supply can die
the system fans can die and cause the system to overheat
cooling in the room the system is in can fail and cause the system to overheat
airflow to the computer can get blocked and cause the system to overheat
some other component in the computer can short out and cause the system to loose power internally
I have had every single one of these things happen to me over the years.
Some on personal equipment, some on work equipment. At work I recently had
a series of disasters where capacitors in a 7 figure UPS blew up, and a
few days later during a power outage when we were running on generator, a
fuel company made a mistake while adding fuel to the generator and knocked
it out.
Even if you spend millions on equipment and professionals to set it up and
maintain it, you can still go down.
You may not care about it on your system (becouse you copy data elsewhere
and don't change it rapidly), but most people do. with your current
approach you are slightly better then a couple shell scripts from an
availability point of view, you are no better in performance, but your
failure mode is complete disaster.
comparing you to 'cp drive ramdisk' at startup and 'rsync ramdisk drive'
periodicly and at shutdown you are faster at startup, close enough at
shutdown as to be in the noise (either one could be faster, depending on
the exact conditions)
you have a failback mode that when the UPS tells you it has failed you
switch to write-through mode, that's some use (but only if you get
everything flushed first)
another off-the-shelf option is that you could use DRDB between the
ramdisk and the real drive, and when you loose power reconfigure to do
syncronous updates instead of write-behind updates. that would still be
far safer then ramback in it's current mode.
> Don't expect this mode in the immediate future though, there are bugs
> to fix in the current driver, which already implements the required
> performance and stability requirements for a broad range of users.
and when those users ask why this functionality isn't in the kernel they
will read this thread and learn how many risks they are taking (in spite
of you promising them that they are perfectly safe)
anyone who has run any significant number of systems will not believe your
statement that hardware and software is reliable enough to be trusted like
this. by continuing to make this claim you are going to be ignored by
those people, and franky, they will distrust any of your work as a result.
>>> That is why I keep recommending that a ramback setup be replicated or
>>> mirrored, which people in this thread keep glossing over. When
>>> replicated or mirrored, you still get the microsecond-level transaction
>>> times, and you get the safety too.
but a straight ramdisk can be replicated or mirrored. there's no need to
have ramback to do this.
>> I agree, but in this case, you should present it this way. You have been
>> insisting too much on the average PC's reliability, the fact that no kernel
>> ever crashed for you, etc... So you are demonstrating that your product is
>> good provided that everything goes perfectly. All people who have experienced
>> software or hardware problems in the past (ie mostly everyone here) will not
>> trust your code because it relies on pre-requisites they know they do not
>> have.
>
> That would have been a miscommunication then. I see arguments coming
> in that suggest embedded solutions, EMC for example, are inherently more
> reliable than a Linux based solution. Well guess what? Some of those
> embedded solutions already use Linux.
they aren't arguing that the embedded solutions are more safe becouse they
don't use linux. they are arguing that they are more safe becouse they
have different enginnering then normal machines, and it's that engineering
that makes them safer, not the software.
the reason why battery backed ram on a raid card is safer than a UPS on a
general purpose machine is becouse the battery backed ram is static ram,
while the ram in your system is dynamid ram. static ram only needs power
to retain it's memory, dynamic ram needs a preocessor running to access
the ram continuously to refresh it.
you see 'battery+ram' in both cases and argue that they are equally safe.
that just isn't the case.
the raid card can be pulled from one machine and put into another, in some
cases the ram can be pulled from one card and plugged into another. it can
sit on a shelf unplugged form anything but the battery for several days.
this means that unless something physicaly damages the ram and enough
drives to fail the raid array, the data is safe.
EMC, Netapp, and the other enterprise vendors have special purpose
hardware to implement this safety. how much special hardware they have
varies by company and equipment, but they all have some.
David Lang
On Sat, Mar 15, 2008 at 02:33:09PM -0800, Daniel Phillips wrote:
> On Saturday 15 March 2008 14:54, Willy Tarreau wrote:
> > I think it could get major adoption with ordered writes.
>
> It already has ordered write when it is in flush mode.
but IIRC it works in write-through then.
> OK, I hear you. There will be an ordered write mode that uses barriers
> to decide the ordering. It will greatly reduce the speed at which
> ramback can flush dirty data because of the need to wait synchronously
> on every barrier, of which there are many. And thus will widen out the
> window during which UPS power must remain available if power goes out,
> in order to get all acknowledged transactions on to stable media. The
> advantage is, the stable media always has a point-in-time version of
> the filesystem.
>
> Don't expect this mode in the immediate future though, there are bugs
> to fix in the current driver, which already implements the required
> performance and stability requirements for a broad range of users.
ah, good!
> > > That is why I keep recommending that a ramback setup be replicated or
> > > mirrored, which people in this thread keep glossing over. When
> > > replicated or mirrored, you still get the microsecond-level transaction
> > > times, and you get the safety too.
> >
> > I agree, but in this case, you should present it this way. You have been
> > insisting too much on the average PC's reliability, the fact that no kernel
> > ever crashed for you, etc... So you are demonstrating that your product is
> > good provided that everything goes perfectly. All people who have experienced
> > software or hardware problems in the past (ie mostly everyone here) will not
> > trust your code because it relies on pre-requisites they know they do not
> > have.
>
> That would have been a miscommunication then. I see arguments coming
> in that suggest embedded solutions, EMC for example, are inherently more
> reliable than a Linux based solution. Well guess what? Some of those
> embedded solutions already use Linux.
But their RAM does not depend on a lot of factors to remain valid and
usable, which is the problem with the common PC.
> Also, peecees are much more reliable than people give them credit for,
> especially if you harden up the obvious points of failure such as fans
> and spinning disks.
and PSU.
> Once you have your system all hardened up, then
> you _still_ better replicate your important data. Perhaps I should not
> admit this, but I simply fail to do that on the machine from which I am
> posting right now, which also runs my web server and mail system. That
> is because I would have to reboot it to install ddsnap so I can replicate
> properly, and because the thing is so darn reliable that I just have
> not gotten around to it. I do copy off the important files from time
> to time though, and do various other things to ameliorate the risk. If
> it was enterprise data I would obviously do a lot more.
Securing every component simply reduces the risk of a loss of service.
What is important with data is to know the consequences of loss of service.
If that only means that no one can work and that the last second of work is
lost, it's generally acceptable. If it means everything is lost to a corrupted
FS, obviously it's not.
> > > There will be a whole bunch of patches from me that are SSD oriented,
> > > over time. The fact is, enterprise scale ramdisks are here now, while
> > > enterprise scale flash is not. Getting close, but not here. And flash
> > > does not approach the write performance of RAM, not now and probably
> > > not ever.
> >
> > My goal is not to replace RAM with flash, but disk with flash.
>
> My immediate goal is to replace disk with RAM.
No, you're replacing disk activity with RAM activity. But you keep disk as
a backend, long-term storage.
> > You are
> > against ordered writes for a performance reason. Use SSD instead of
> > hard drives and it will be as fast as sequential writes. Also, when
> > you say that enterprise scale flash is not there, I don't agree. You
> > can already afford hundreds of gigs of flash in 3,5" form factor. An
> > 1.6 TB SSD has even been presented at CES2008, with sales announced
> > for Q3. So clearly this will replace your hard drives soon, very soon.
> > Even if it costs $5k, that's a very acceptable solution to replace a
> > disk in a RAM-speed appliance.
>
> Exactly what I mean: close but not there. Those gigantic RAM boxes are
> shipping now, and the same company has got a 5 TB flash box coming down
> the pipe, and sooner than Q3. But the RAM box will always outperform
> the flash box. You just keep throwing writes at it until all available
> flash is in erase mode, and the thing slows down. If that is not a
> problem for you, then great, you can also save a lot of money by going
> with flash. But if nothing less than the ultimate in performance will
> do, RAM is the only way to go.
Sorry if I was not clear. I was not speaking about replacing the RAM with
flash, but only the disks. You keep the RAM for the speed, and use flash
for permanent storage instead of disks. No seek time, average RW speed now
slightly better than disks, that combined with your ramdisk and ordered
write-backs writes will have the best of both worlds : RAM speed and flash
reliability.
Willy
> > It makes a lot of difference,
>
> It makes a difference of degree, not of kind.
I think "I get my data back" is a difference in kind.
> But interpreting a barrier to mean flush through to rotating media...
> performance will drop to the millisecond per transaction zone, like a
That isn't anything to do with what was being proposed. *ORDERING* not
flush to media.
> Even raid controllers... so you agree that some of them just don't
> respond conservatively to tagged commands, either because the engineers
> don't know how to implement that (unlikely) or because they want to win
> the performance benchmarks, and they do trust their battery?
The ones that don't respect tagged ordering are the ultra cheap nasty
things you buy down the local computer store that come with a 2 page
manual in something vaguely like English. The stuff used for real work is
quite different.
> Ramback already is taken seriously, just not by you. That is fine, you
> apparently do not need or want the speed.
I want the speed and reliability. Without that ramback is a distraction
until someone solves the real problems.
> they need to achieve microsecond level transaction throughput and data
You have no guarantee of commit to stable storage so your use of the word
"transaction" is a bit farcical.
There are a whole variety of ways to get far better results than "whoops
bang there goes the file system". Log structured backing media is one,
even snapshots. That way you'd quantify that for the cost of more
rotating storage (which is cheap) you can only lose "x" minutes of data
and will lose everything from a defined consistent point. File based
backing store also has similar properties done right, but needs some
higher level care to track closure and dirty blocks on a per inode basis.
Alan
In article <[email protected]> you wrote:
>> RAID controllers do not have half a terabyte of RAM.
>
> And? Either you have battery backed ram with critical data in it or
> you do not. Exactly how much makes little difference to the question.
Besides, some SAN Storage Devices do have that amount of Ram. However it is
better protected as in your typical PC. With Mirroring, it can be removed
(including the battery packs) - and there is a procedure to actually replay
the buffers once the new devices are in place.
But thats not an argument against or in favor of Ramback, its just two
different things. You would be suprised how many databases run on write back
mode disks without fdsync() any nobody cares :)
Greetings
Bernd
[email protected] writes:
> dynamic ram needs a preocessor
> running to access the ram continuously to refresh it.
Actually modern DRAM can be put into "self refresh" mode which don't
need (nor allow) any external accesses. Not very practical in typical
PC case, though I think suspend to RAM uses it. Could be used for
battery - backed RAID/disk controller as well.
Obviously it changes nothing WRT ramback.
--
Krzysztof Halasa
On Saturday 15 March 2008 16:22, Willy Tarreau wrote:
> > That would have been a miscommunication then. I see arguments coming
> > in that suggest embedded solutions, EMC for example, are inherently more
> > reliable than a Linux based solution. Well guess what? Some of those
> > embedded solutions already use Linux.
>
> But their RAM does not depend on a lot of factors to remain valid and
> usable, which is the problem with the common PC.
For example?
Anecdote time. Remember there used to be "brand name" floppy disks and
generic floppy disks, and the brand name ones cost a lot more because
they were supposedly safer? Well, big secret, studies were done and
the no-name disks came out better. Why? Because selling at commodity
prices the generic makers could not afford returns. So they made them
well.
It is like that with PCs. Supposedly you get a lot more reliability
when you spend more money and buy all high end near-custom gear. In
fact, the cheap stuff just keeps on chugging, because those guys can't
afford to have it break.
So please don't underestimate the reliability of a PC.
There are bits of Linux that are undeniably dodgy. We get a lot of bug
reports about usb for example, keyboards just quitting and it's not the
keyboard's fault. Just say no to usb in a server, at least until some
fundamental cleanup happens there.
The worst bug I've seen in a server this year? A buggy bios in a Dell
server that would issue a keyboard error and sit and wait for somebody
to press F1 when there was no keyboard attached. That is embedded
software for you. Personally, I think we do way better than that in
Linux.
> > Also, peecees are much more reliable than people give them credit for,
> > especially if you harden up the obvious points of failure such as fans
> > and spinning disks.
>
> and PSU.
Yes. Dual power supplies are highly recommended for this application.
With dual power supplies you can carry out preemptive maintenance on
the UPS.
> Securing every component simply reduces the risk of a loss of service.
> What is important with data is to know the consequences of loss of service.
> If that only means that no one can work and that the last second of work is
> lost, it's generally acceptable. If it means everything is lost to a corrupted
> FS, obviously it's not.
So mirror two of them, I keep saying. If that is not good enough for
you, then make it three way, and replicate for good measure. The thing
is, none of that hurts the microsecond level performance, and it gets
you whatever data security you desire. Whereas anything that requires
waiting on disk transactions does hurt performance. Since my interest
currently lies in high performance, that is where my effort goes. And
do I need to say it: patches gratefully accepted.
For my immediate application... hacking the kernel in comfort... just
replicating will provide all the data safety I need.
> Sorry if I was not clear. I was not speaking about replacing the RAM with
> flash, but only the disks. You keep the RAM for the speed, and use flash
> for permanent storage instead of disks. No seek time, average RW speed now
> slightly better than disks, that combined with your ramdisk and ordered
> write-backs writes will have the best of both worlds : RAM speed and flash
> reliability.
Right. What we are talking about is filling in a missing level in the
cache hierarchy, something like:
L1 .3 ns
L2 3 ns
L3 30 ns
Ramdisk 2 us
Flash 20 us
Disk 3 ms
Approximate, numbers not necessarily too accurate, but you know what I
mean. Currently there is this gigantic performance cliff between L3
memory and disk. Something like the Violin ramdisk fills it in nicely.
And see, you still need that rotating media because it always will be
an order of magnitude cheaper than flash. Tape might still fit in
there too, though these days it seems increasingly doubtful.
Daniel
Daniel Phillips wrote:
> when you spend more money and buy all high end near-custom gear. In
> fact, the cheap stuff just keeps on chugging, because those guys can't
> afford to have it break.
>
> So please don't underestimate the reliability of a PC.
>
I strongly disagree. Cheap PC hardware is not even close to the quality
of a serious, branded machine. Often capacitors are missing from power
lines, and the ones that are installed fail sooner. Cooling fans are
lower quality and fail much sooner. Timing issues abound.
There's a reason why an IBM is a better machine than a "Black-n-Gold":
IBM value their name so when you have a problem, they have a problem.
Buy generic and when you get a problem they already have your money and
since they have no investment in their name, they have nothing more to
care about.
Daniel Phillips wrote:
>> Also, please note that the problem here is not related to the number of
>> nines of availability. This number only counts the ratio between uptime
>> and downtime. We're more facing a problem of MTBF, where the consequences
>> of a failure are hard to predict.
>>
>
> That is why I keep recommending that a ramback setup be replicated or
> mirrored, which people in this thread keep glossing over. When
> replicated or mirrored, you still get the microsecond-level transaction
> times, and you get the safety too.
Do you mean it should be replicated with a second ramback? That would
be pretty pointless, since all failure modes would affect both. It's
not like one ramback will survive a crash when the other doesn't.
On Sat, Mar 15, 2008 at 07:33:07PM -0800, Daniel Phillips wrote:
> On Saturday 15 March 2008 16:22, Willy Tarreau wrote:
> > > That would have been a miscommunication then. I see arguments coming
> > > in that suggest embedded solutions, EMC for example, are inherently more
> > > reliable than a Linux based solution. Well guess what? Some of those
> > > embedded solutions already use Linux.
> >
> > But their RAM does not depend on a lot of factors to remain valid and
> > usable, which is the problem with the common PC.
>
> For example?
What I mean is that in a PC, RAM contents are very fragile :
- weak batteries in your UPS => end of game
- loosy power cable between UPS and PC => end of game (BTW I have a customer
who had such a problem, cables had both disconnected because of their own
weight).
- kernel panic => end of game
- user error during planned maintenance => end of game
- flaky driver writing to wrong memory location => can't trust your data
In a normal PC, even if the RAM itself is a reliable component (ECC, ...)
a lot of such problems which may happen will render it unusable. If you
have to reboot, your BIOS will clean it up for you. That's why people are
trying to explain to you that linux is not reliable enough to work like
this.
Now if you have all your RAM on a PCI-E board with a battery and which is
not initialized by the BIOS so that it survives reboots, it changes a LOT
of things, because all the problems mentionned above go away. Let me
repeat it, the problem is not that those components are too unreliable
to build a transactional system, it is that used in this manner, a very
simple failure of any of them is enough to lose/corrupt all of your data.
Reason why people insist on ordered writes with regular flushes.
> Anecdote time. Remember there used to be "brand name" floppy disks and
> generic floppy disks, and the brand name ones cost a lot more because
> they were supposedly safer? Well, big secret, studies were done and
> the no-name disks came out better. Why? Because selling at commodity
> prices the generic makers could not afford returns. So they made them
> well.
That was not my experience when I was a student. We would buy very cheap
diskettes which were only sold by 100. 20% of them were already defective,
and 20% of the remaining ones could not keep our data till the next morning!
I knew guys who finally stopped copying games due to those diskettes, so
we believed they were sold by game editors :-)
> It is like that with PCs. Supposedly you get a lot more reliability
> when you spend more money and buy all high end near-custom gear. In
> fact, the cheap stuff just keeps on chugging, because those guys can't
> afford to have it break.
>
> So please don't underestimate the reliability of a PC.
If you have understood what I explained above, now you'll understand that
I'm not underestimating the reliability of my PC, just the fact that keeping
access to my RAM contents involves a lot of components, any of which will
definitely ruin my data in case of failure.
> There are bits of Linux that are undeniably dodgy. We get a lot of bug
> reports about usb for example, keyboards just quitting and it's not the
> keyboard's fault. Just say no to usb in a server, at least until some
> fundamental cleanup happens there.
unfortunately, new servers are often USB-only.
> The worst bug I've seen in a server this year? A buggy bios in a Dell
> server that would issue a keyboard error and sit and wait for somebody
> to press F1 when there was no keyboard attached.
I thought this stupidity disappeared about 5 years ago ? I was about to
build PIC-based PS/2 "terminators" to plug into machines to avoid this
problem at that time.
> That is embedded software for you. Personally, I think we do way
> better than that in Linux.
>
> > > Also, peecees are much more reliable than people give them credit for,
> > > especially if you harden up the obvious points of failure such as fans
> > > and spinning disks.
> >
> > and PSU.
>
> Yes. Dual power supplies are highly recommended for this application.
> With dual power supplies you can carry out preemptive maintenance on
> the UPS.
>
> > Securing every component simply reduces the risk of a loss of service.
> > What is important with data is to know the consequences of loss of service.
> > If that only means that no one can work and that the last second of work is
> > lost, it's generally acceptable. If it means everything is lost to a corrupted
> > FS, obviously it's not.
>
> So mirror two of them, I keep saying. If that is not good enough for
> you, then make it three way, and replicate for good measure. The thing
> is, none of that hurts the microsecond level performance, and it gets
> you whatever data security you desire. Whereas anything that requires
> waiting on disk transactions does hurt performance. Since my interest
> currently lies in high performance, that is where my effort goes.
I never spoke about waiting for disk transactions. The RAM must be the
only source and target of user data. Disk is there for permanent storage
and should be written to in the background. YOU proposed the write-through
alternative with your "echo 1". But obviously this voids any advantage of
your work.
> And do I need to say it: patches gratefully accepted.
Hey thanks, but we're not on freshmeat : "here's version 0.1 of foobar,
right now it does nothing but given a massive amount of contributors it
will replace a datacenter in a matchbox".
> For my immediate application... hacking the kernel in comfort... just
> replicating will provide all the data safety I need.
Daniel, you must understand that it is not because it suits *your* needs
that your project will get broad adoption. Many people are showing you
what they don't like in it, and it's not even a design problem, it's just
the way data are synchronized. I think that if you spent your time on
your code instead of arguing by mail against each of us, you would have
already got ordered writes working.
> > Sorry if I was not clear. I was not speaking about replacing the RAM with
> > flash, but only the disks. You keep the RAM for the speed, and use flash
> > for permanent storage instead of disks. No seek time, average RW speed now
> > slightly better than disks, that combined with your ramdisk and ordered
> > write-backs writes will have the best of both worlds : RAM speed and flash
> > reliability.
>
> Right. What we are talking about is filling in a missing level in the
> cache hierarchy, something like:
>
> L1 .3 ns
> L2 3 ns
> L3 30 ns
> Ramdisk 2 us
> Flash 20 us
> Disk 3 ms
>
> Approximate, numbers not necessarily too accurate, but you know what I
> mean. Currently there is this gigantic performance cliff between L3
> memory and disk. Something like the Violin ramdisk fills it in nicely.
> And see, you still need that rotating media because it always will be
> an order of magnitude cheaper than flash.
"always" is far from being a certitude here. "still" is right though.
Prices are driven by customer demand. And building a 128 GB flash
requires a lot less efforts than a hard drive containing a lot of
fragile mechanics. However, I'm not sure that flash will be as much
resistant to environmental annoyances that we're happy to ignore
today, such as solar winds and cosmic rays. Future will tell.
> Tape might still fit in
> there too, though these days it seems increasingly doubtful.
Tapes are used for long-term archival. You can read a tape 20 years
after having written it. A disk... well, the interface to plug it
does not exist anymore, even the electronics process have changed,
as well as voltage levels. Check in your boxes if you have an old
MFM or RLL disk, and see where you can plug it. Maybe you'll find
an old ISA controller with a corrupted BIOS (too old) or at least
which does not support machines faster than 25 MHz.
Tape vendors will still sell you the tape drive (at an amazing
price BTW).
Willy
On Sunday 16 March 2008, David Newall wrote:
> There's a reason why an IBM is a better machine than a "Black-n-Gold":
> IBM value their name so when you have a problem, they have a problem.
> Buy generic and when you get a problem they already have your money and
> since they have no investment in their name, they have nothing more to
> care about.
That's just nonsense in a consolidated market.
You change to IBM, then to Dell, then to HP
then again to IBM. Maybe you even try Sun.
That causes you more grief than any one of them.
I have seen people doing that in all industry branches
and even privately.
If you love brands, then your choice becomes very limited.
That's the real reason for them being much more expensive.
If you think machines and specs, then you have a much more clear
picture. After a while you even have your own measures for failure
rates of those components and can handle it. No matter which brand :-)
Best Regards
Ingo Oeser
> Anecdote time. Remember there used to be "brand name" floppy disks and
> generic floppy disks, and the brand name ones cost a lot more because
> they were supposedly safer? Well, big secret, studies were done and
> the no-name disks came out better. Why? Because selling at commodity
> prices the generic makers could not afford returns. So they made them
> well.
Which is not the case for PCs
>
> It is like that with PCs.
Nope.
> when you spend more money and buy all high end near-custom gear. In
> fact, the cheap stuff just keeps on chugging, because those guys can't
> afford to have it break.
They don't care if it breaks after 12 months, and for components and
addons they don't care if it breaks, they just blame the end user for
mis-installation or 'incompatibility'. There is a huge difference in
quality between high end server boards and cheap desktop PC systems.
> Right. What we are talking about is filling in a missing level in the
> cache hierarchy, something like:
Perhaps. But if your cache can destroy the contents of the layer below in
situations that do occur it isn't useful. If you can fix that then it
obviously has a lot of potential.
Alan
On Sun, Mar 16, 2008 at 01:14:49PM +0000, Alan Cox wrote:
> > when you spend more money and buy all high end near-custom gear. In
> > fact, the cheap stuff just keeps on chugging, because those guys can't
> > afford to have it break.
>
> They don't care if it breaks after 12 months, and for components and
> addons they don't care if it breaks, they just blame the end user for
> mis-installation or 'incompatibility'. There is a huge difference in
> quality between high end server boards and cheap desktop PC systems.
Actually, it's worse than that. Users have been trained that when a
computer bluescreens and losing all of their data, it's either (a)
just the way things are, or (b) it's microsoft's fault. Worse yet,
thanks to things like PC benchmarks, hard drive manutacturers have in
the past been encouraged to do things like lie to the OS about when
things had hit the hard drive platter just to score higher numbers on
winbench.
All of this is why I've in the past summed all of this up as Ted's law
of PC class hardware, which is that PC class hardware is cr*p. :-)
- Ted
On Saturday 15 March 2008 22:42, David Newall wrote:
> Daniel Phillips wrote:
> > That is why I keep recommending that a ramback setup be replicated or
> > mirrored, which people in this thread keep glossing over. When
> > replicated or mirrored, you still get the microsecond-level transaction
> > times, and you get the safety too.
>
> Do you mean it should be replicated with a second ramback? That would
> be pretty pointless, since all failure modes would affect both. It's
> not like one ramback will survive a crash when the other doesn't.
A second machine running a second ramback, on a second UPS pair.
I thought that was obvious.
Daniel
On Saturday 15 March 2008 16:05, Alan Cox wrote:0
> > ,,,interpreting a barrier to mean flush through to rotating media...
> > performance will drop to the millisecond per transaction zone...
>
> That isn't anything to do with what was being proposed. *ORDERING* not
> flush to media.
This is where you have made a fundamental mistake in your proposal.
Suppose you have a steady, heavy write load onto ramback. Eventually,
the entire ramdisk will be dirty and you have to drop back to disk
speed, right? My design does not suffer from that problem, but your
proposal does.
It gets worse than that. Suppose somebody writes the same region
twice, how do you order that? Do you try to store that new data
somewhere, keeping in mind that we are already at terabyte scale? Is
there a limit on how much overwrite data you may have to store? (No.)
> I want the speed and reliability. Without that ramback is a distraction
> until someone solves the real problems.
Somebody has. But please feel free to solve some other problem. I
would love to see a detailed design from you, or a patch.
> > they need to achieve microsecond level transaction throughput and data
>
> You have no guarantee of commit to stable storage so your use of the word
> "transaction" is a bit farcical.
The UPS provides a guarantee of commit to stable storage. No amount of
FUD will change that. But please go ahead and calculate the risks
involved. I am confident you will admit that there are standard]
techniques available to ameliorate risk, which may be applied _on top of_
ramback, thus not destroying its microsecond-level transaction
performance as you propose.
Daniel
Daniel Phillips <[email protected]> writes:
> Anecdote time. Remember there used to be "brand name" floppy disks and
> generic floppy disks, and the brand name ones cost a lot more because
> they were supposedly safer? Well, big secret, studies were done and
> the no-name disks came out better. Why? Because selling at commodity
> prices the generic makers could not afford returns. So they made them
> well.
I don't think so. I remember we had much more problems with noname
disks. And yes, certain brands had been problematic too, but most
(such as Fuji and 3M, and others) were fine.
> It is like that with PCs. Supposedly you get a lot more reliability
> when you spend more money and buy all high end near-custom gear. In
> fact, the cheap stuff just keeps on chugging, because those guys can't
> afford to have it break.
The real life can't agree with this at all. The servers keep working
for years and the cheap stuff quit fast (if initially working, which
is not always the case).
> The worst bug I've seen in a server this year? A buggy bios in a Dell
> server that would issue a keyboard error and sit and wait for somebody
> to press F1 when there was no keyboard attached.
Most BIOS (all I've seen in this Millennium) have an option to disable
that.
On a server board you can usually have a remote console, how could
that work otherwise?
> That is embedded
> software for you.
Server != embedded.
> Yes. Dual power supplies are highly recommended for this
> application.
:-)
So which user groups are you aiming at exactly?
> Right. What we are talking about is filling in a missing level in the
> cache hierarchy, something like:
>
> L1 .3 ns
> L2 3 ns
> L3 30 ns
> Ramdisk 2 us
> Flash 20 us
> Disk 3 ms
We already have RAM between L3 and Flash.
The problem is flushing L1 to disk/flash takes time.
--
Krzysztof Halasa
> > That isn't anything to do with what was being proposed. *ORDERING* not
> > flush to media.
>
> This is where you have made a fundamental mistake in your proposal.
> Suppose you have a steady, heavy write load onto ramback. Eventually,
> the entire ramdisk will be dirty and you have to drop back to disk
> speed, right? My design does not suffer from that problem, but your
> proposal does.
In your design the entire ramdisk goes bang and disappears on a crash.
> It gets worse than that. Suppose somebody writes the same region
> twice, how do you order that? Do you try to store that new data
> somewhere, keeping in mind that we are already at terabyte scale? Is
> there a limit on how much overwrite data you may have to store? (No.)
You only have to care about ordering if there is a store barrier between
the two (not usual). You only have to care about filling if you generate
enough dirty blocks at a very high rate (which is unusual for most
workloads). If you don't care about those then we have ramdisk already and
if you want to write a ramdisk driver for external ramdisk great. You'd
also fix the layering violations then by allowing device mapper to
implement things like snapshotting and writeback seperated from your
driver.
Even in the extreme case that you propose there are trivial ways of
getting coherency. Simple example - if you can sweep all the data out in
say 10 minutes then you can buy twice the physical media and ensure that
one of the two sets of disk backups is genuinely store barrier consistent
to some snapshot time (say every 30 minutes but obviously user tunable).
If you at least had some kind of credible snapshotting you'd find people
less hostile to your glorified ramdisk.
> > You have no guarantee of commit to stable storage so your use of the word
> > "transaction" is a bit farcical.
>
> The UPS provides a guarantee of commit to stable storage. No amount of
Stable storage to most people means "won't go away on a bad happening".
Transaction likewise has a specific meaning in terms of an event occuring
once only an either being recorded before or after the transaction
occurred.
Alan
Willy Tarreau <[email protected]> writes:
> Tapes are used for long-term archival. You can read a tape 20 years
> after having written it.
Well, you better check them regularly, you never know :-)
> Tape vendors will still sell you the tape drive (at an amazing
> price BTW).
Not sure if things like SLR-2 or so are still available, except second
hand. But they at least provide compatibility for some time.
--
Krzysztof Halasa
David Newall <[email protected]> writes:
> Do you mean it should be replicated with a second ramback? That would
> be pretty pointless, since all failure modes would affect both. It's
> not like one ramback will survive a crash when the other doesn't.
It could, in a bit different location maybe, but it isn't a substitute
for ordered writes.
--
Krzysztof Halasa
Hi Alan,
On Sunday 16 March 2008 14:55, Alan Cox wrote:
> > > That isn't anything to do with what was being proposed. *ORDERING* not
> > > flush to media.
> >
> > This is where you have made a fundamental mistake in your proposal.
> > Suppose you have a steady, heavy write load onto ramback. Eventually,
> > the entire ramdisk will be dirty and you have to drop back to disk
> > speed, right? My design does not suffer from that problem, but your
> > proposal does.
>
> In your design the entire ramdisk goes bang and disappears on a crash.
According to you. A more accurate statement: if you have the ramdisk
on the host, then the host is assumed to be reliable. If the ramdisk
is external (http://www.violin-memory.com/products/violin1010.html)
then your statement is untrue in every sense.
But you did not address the logic of my statement above: that your
fundamental design prevents you from operating at ramdisk speed during
normal operation.
> > It gets worse than that. Suppose somebody writes the same region
> > twice, how do you order that? Do you try to store that new data
> > somewhere, keeping in mind that we are already at terabyte scale? Is
> > there a limit on how much overwrite data you may have to store? (No.)
>
> You only have to care about ordering if there is a store barrier between
> the two (not usual).
No wait, it is completely normal. There is a barrier on every journal
transaction. Constructing such a load is trivial.
> You only have to care about filling if you generate
> enough dirty blocks at a very high rate (which is unusual for most
> workloads).
It is completely normal for a transaction processing system.
> If you don't care about those then we have ramdisk already and
I care about them, as do others.
> if you want to write a ramdisk driver for external ramdisk great. You'd
Exactly the purpose for which this driver was written. And as a bonus
it happens to be useful for internal ramdisk applications as well. (It
is useful for me, however your mileage may vary.)
> also fix the layering violations then by allowing device mapper to
> implement things like snapshotting and writeback seperated from your
> driver.
Device mapper already can, so I do not get your point. Also, what is
this layering violation you refer to?
> Even in the extreme case that you propose there are trivial ways of
> getting coherency. Simple example - if you can sweep all the data out in
> say 10 minutes
If.
> then you can buy twice the physical media and ensure that
> one of the two sets of disk backups is genuinely store barrier consistent
> to some snapshot time (say every 30 minutes but obviously user tunable).
> If you at least had some kind of credible snapshotting you'd find people
> less hostile to your glorified ramdisk.
Hostility does not equate to accuracy. Galileo comes to mind.
I see people arguing that a server+linux+batteries+mirroring+replication
cannot achieve enterprise grade reliability. Balderdash.
Regards,
Daniel
On Sunday 16 March 2008 15:15, Krzysztof Halasa wrote:
> David Newall <[email protected]> writes:
> > Do you mean it should be replicated with a second ramback? That would
> > be pretty pointless, since all failure modes would affect both. It's
> > not like one ramback will survive a crash when the other doesn't.
>
> It could, in a bit different location maybe, but it isn't a substitute
> for ordered writes.
How so?
Daniel
> Hostility does not equate to accuracy. Galileo comes to mind.
I see no attempt to even discuss the use of two sets
of physical storage to maintain coherent snapshots, just comments about
hostility. That's a fairly poor way to repay people who spend a lot of
time working with enterprise customers and are interested in solutions
using things like giant ramdisks and are putting in time to discuss
alternative ways of achieving the desired result.
> I see people arguing that a server+linux+batteries+mirroring+replication
> cannot achieve enterprise grade reliability. Balderdash.
I look forward to seeing your constructive detailed analysis of failure
modes based upon actual statistical data from real data centres. Unless
you can produce that nobody is going to take you seriously, which is bad
luck for the poor folks at violin if they are relying on you.
Alan
Daniel Phillips <[email protected]> writes:
>> It could, in a bit different location maybe, but it isn't a substitute
>> for ordered writes.
>
> How so?
Not sure if I understand the question correctly but obviously a pair
(mirror) of servers running "dangerous" ramback would survive a crash
of one machine and we could practically eliminate the probability of
both (all) machines crashing simultaneously. However, there are
cheaper ways to achieve similar performance and even better
reliability - including those battery-backed (RAI)Disk controllers.
--
Krzysztof Halasa
On Sunday 16 March 2008 15:46, Alan Cox wrote:
> > Hostility does not equate to accuracy. Galileo comes to mind.
>
> I see no attempt to even discuss the use of two sets
> of physical storage to maintain coherent snapshots, just comments about
> hostility. That's a fairly poor way to repay people who spend a lot of
> time working with enterprise customers and are interested in solutions
> using things like giant ramdisks and are putting in time to discuss
> alternative ways of achieving the desired result.
You did not explain how your proposal will avoid dropping the transaction
throughput down to disk speed, whereas I have explained how my existing
design can be made to achieve enterprise-grade data safety.
As far as I can see, there is no way to do what you propose without
losing a couple of orders of magnitude of transaction response time
under normal running conditions. If you can see a method that I cannot,
then I am all ears. Until then... I just loved the earlier post to the
thread:
"It is not the best way to travel faster than light. It is just
the only way".
I do not think you have set out to solve the same problem I have, which
is to attain the highest possible transaction throughput with enterprise
scale reliability. If you would like to take my code and modify it to
work just as you wish it, then of course you are more than welcome.
Daniel
On Sunday 16 March 2008 16:08, Krzysztof Halasa wrote:
> Daniel Phillips <[email protected]> writes:
> >> It could, in a bit different location maybe, but it isn't a substitute
> >> for ordered writes.
> >
> > How so?
>
> Not sure if I understand the question correctly but obviously a pair
> (mirror) of servers running "dangerous" ramback would survive a crash
> of one machine and we could practically eliminate the probability of
> both (all) machines crashing simultaneously. However, there are
> cheaper ways to achieve similar performance and even better
> reliability - including those battery-backed (RAI)Disk controllers.
OK, so we are only searching for the cheapest way to achieve these
kinds of speeds, for some given uptime and risk level requirements.
That is a really interesting subject, but can we please leave it for a
while so I can get some work done on the code itself?
Thanks,
Daniel
Daniel Phillips wrote:
> The UPS provides a guarantee of commit to stable storage. No amount of
> FUD will change that.
What about system crashes? They guarantee that data will be lost. I
know opinions are divided on the subject of crashes: You say Linux
doesn't; everybody else says it does. I side with experience. (It does.)
On Sunday 16 March 2008 18:31, David Newall wrote:
> Daniel Phillips wrote:
> > The UPS provides a guarantee of commit to stable storage. No amount of
> > FUD will change that.
>
> What about system crashes? They guarantee that data will be lost. I
Not if it is mirrored and replicated. Also nice if crashes are very
rare, which they are unless you work at it.
> know opinions are divided on the subject of crashes: You say Linux
> doesn't; everybody else says it does. I side with experience. (It does.)
I say it does not crash often, to the point where I have not seen it
crash once for any reason I did not create myself (I tend to wait for
the occasional brown bag release to fade away before shifting development We do get quite a few
reports of less mature systems like hald and usb causing problems, and
not too long ago NFS client was very crash happy. I did see some of
those myself two years ago, and fixed them.
On the whole, Linux is very reliable. Very very reliable. Now mirror
that, replicate it, add in 2 x 2 redundant power supplies backed by
independent UPS units so you can do regular preemptive maintenance on
the batteries, and you have a sweet enterprise transaction processing
system. All set for a faster than light moon shot :-)
Daniel
On Sun, 16 Mar 2008, Daniel Phillips wrote:
> On Sunday 16 March 2008 18:31, David Newall wrote:
>> Daniel Phillips wrote:
>>> The UPS provides a guarantee of commit to stable storage. No amount of
>>> FUD will change that.
>>
>> What about system crashes? They guarantee that data will be lost. I
>
> Not if it is mirrored and replicated. Also nice if crashes are very
> rare, which they are unless you work at it.
if you are depending on replication over the network you have just limited
your throughput to your network speed and latency. on an enterprise level
machine the network can frequently be significantly slower than the disk
array that you are so frantic to avoid waiting for.
David Lang
On Sunday 16 March 2008 20:59, [email protected] wrote:
> On Sun, 16 Mar 2008, Daniel Phillips wrote:
> > On Sunday 16 March 2008 18:31, David Newall wrote:
> >> Daniel Phillips wrote:
> >>> The UPS provides a guarantee of commit to stable storage. No amount of
> >>> FUD will change that.
> >>
> >> What about system crashes? They guarantee that data will be lost. I
> >
> > Not if it is mirrored and replicated. Also nice if crashes are very
> > rare, which they are unless you work at it.
>
> if you are depending on replication over the network you have just limited
> your throughput to your network speed and latency.
Replication does not work that way. On each replication cycle, the
differences between the most recent two volume snapshots go over the
network. This strategy has the nice effect of consolidating rewrites.
There are also excellent delta compression opportunities.
In the worst case, with insufficient bandwidth for the churn rate of
the volume, replication rate increases to the time for replicating the
full volume. Again, at worst, this would require extra storage for the
snapshot to be replicated equivalent to the original volume size, so
that the primary volume is not forced to wait synchronously for a
replication cycle to complete.
Mirroring on the other hand, makes a realtime copy of a volume, that is
never out of date.
I hope this helps.
> on an enterprise level
> machine the network can frequently be significantly slower than the disk
> array that you are so frantic to avoid waiting for.
Frantic... your word. Designing for dependably high transaction rates
requires a different mode of thinking that some traditionalists seem to
be having some trouble with.
Daniel
On Sun, 16 Mar 2008, Daniel Phillips wrote:
> On Sunday 16 March 2008 20:59, [email protected] wrote:
>> On Sun, 16 Mar 2008, Daniel Phillips wrote:
>>> On Sunday 16 March 2008 18:31, David Newall wrote:
>>>> Daniel Phillips wrote:
>>>>> The UPS provides a guarantee of commit to stable storage. No amount of
>>>>> FUD will change that.
>>>>
>>>> What about system crashes? They guarantee that data will be lost. I
>>>
>>> Not if it is mirrored and replicated. Also nice if crashes are very
>>> rare, which they are unless you work at it.
>>
>> if you are depending on replication over the network you have just limited
>> your throughput to your network speed and latency.
>
> Replication does not work that way. On each replication cycle, the
> differences between the most recent two volume snapshots go over the
> network. This strategy has the nice effect of consolidating rewrites.
> There are also excellent delta compression opportunities.
>
> In the worst case, with insufficient bandwidth for the churn rate of
> the volume, replication rate increases to the time for replicating the
> full volume. Again, at worst, this would require extra storage for the
> snapshot to be replicated equivalent to the original volume size, so
> that the primary volume is not forced to wait synchronously for a
> replication cycle to complete.
>
> Mirroring on the other hand, makes a realtime copy of a volume, that is
> never out of date.
so just mirror to a local disk array then.
a local disk array has more write bandwidth than a network connection to a
remote machine, so if you can mirror to a remote machine you can mirror to
a local disk array.
> I hope this helps.
not in the least.
>> on an enterprise level
>> machine the network can frequently be significantly slower than the disk
>> array that you are so frantic to avoid waiting for.
>
> Frantic... your word. Designing for dependably high transaction rates
> requires a different mode of thinking that some traditionalists seem to
> be having some trouble with.
if by traditionalists you mean everyone who makes a living keeping systems
running you are right. we want sane failure modes as much as we want
performance.
there will be times when we decide to go for speed at the expense of
safety, but we want to do it knowingly, not when someone is promising both
and only provides speed.
and by the way, if the violin box use your software they have just moved
from a resource for me to tap when needed to something that I will advise
my company to avoid at all costs.
David Lang
Daniel Phillips wrote:
> On Sunday 16 March 2008 20:59, [email protected] wrote:
>
>> On Sun, 16 Mar 2008, Daniel Phillips wrote:
>>
>>> On Sunday 16 March 2008 18:31, David Newall wrote:
>>>
>>>> Daniel Phillips wrote:
>>>>
>>>>> The UPS provides a guarantee of commit to stable storage. No amount of
>>>>> FUD will change that.
>>>>>
>>>> What about system crashes? They guarantee that data will be lost. I
>>>>
>>> Not if it is mirrored and replicated. Also nice if crashes are very
>>> rare, which they are unless you work at it.
>>>
>> if you are depending on replication over the network you have just limited
>> your throughput to your network speed and latency.
>>
>
> Replication does not work that way. On each replication cycle, the
> differences between the most recent two volume snapshots go over the
> network. [...]
> Mirroring on the other hand, makes a realtime copy of a volume, that is
> never out of date.
>
I think you've just tried to obfuscate the truth. As you have
described, replication does not provide full protection against data
loss; it loses all changes since last cycle. Recall that it was you who
introduced the word "replication", in the context of guaranteeing no
loss of data. Then you ignored David's point about the relatively low
speed of networks, remarking only that mirroring is real-time. Reading
between your words makes clear that "mirroring and replication" does
reduce the performance. (You claimed microsecond-level transaction times.)
> Designing for dependably high transaction rates
> requires a different mode of thinking that some traditionalists seem to
> be having some trouble with.
You've rather under-valued dependability, though. Even your idea of
mirroring systems is incomplete, because failure of the principle system
requires transparent fail-over to the redundant system, which is
actually quite challenging, especially with commodity systems hobbled
together in the way you promote. Remember that you claimed
microsecond-level transaction times, and 6-nines of availability. The
former seems unlikely with replicated systems and, in the event of a
failure, you won't achieve the latter.
You still haven't investigated the benefit of your idea over a whopping
great buffer cache. What's the point in all of this if it turns out, as
Alan hinted should be the case, that a big buffer cache gives much the
same performance? You appear to have gone to a great deal of effort
without having performed quite simple yet obvious experiments.
On Sunday 16 March 2008 23:49, [email protected] wrote:
> > Mirroring on the other hand, makes a realtime copy of a volume, that is
> > never out of date.
>
> so just mirror to a local disk array then.
Great idea. Except that the disk array has millisecond level latency,
when what we trying to achieve is microsecond level latency.
> a local disk array has more write bandwidth than a network connection to a
> remote machine, so if you can mirror to a remote machine you can mirror to
> a local disk array.
So you could potentially connect to a _huge_ disk array and write deltas
to it. The disk array would have to support roughly 3 Gbytes/second of
write bandwidth to keep up with the Violin ramdisk. Doable, but you are
now in the serious heavy iron zone.
Personally, I like my nice simple design a lot more. Just mirror it, as
many times as you need to satisfy your paranoia. Or how about go write
your own?
Daniel
On Monday 17 March 2008 00:14, David Newall wrote:
> >> if you are depending on replication over the network you have just limited
> >> your throughput to your network speed and latency.
> >
> > Replication does not work that way. On each replication cycle, the
> > differences between the most recent two volume snapshots go over the
> > network. [...]
> > Mirroring on the other hand, makes a realtime copy of a volume, that is
> > never out of date.
>
> I think you've just tried to obfuscate the truth. As you have
> described, replication does not provide full protection against data
> loss; it loses all changes since last cycle. Recall that it was you who
> introduced the word "replication", in the context of guaranteeing no
> loss of data.
You are twisting words. I may have said that replication provides a
point-in-time copy of a volume, which is exactly what it does, no more,
no less.
> You still haven't investigated the benefit of your idea over a whopping
> great buffer cache. What's the point in all of this if it turns out, as
> Alan hinted should be the case, that a big buffer cache gives much the
> same performance? You appear to have gone to a great deal of effort
> without having performed quite simple yet obvious experiments.
A big buffer cache does not provide a guarantee that the dirty cache
data saved to disk when line power is lost. If you would like to
add that feature to the Linux buffer cache, then please do it, or make
whichever other contribution you wish to make. If you just want to
explain to me one more time that Linux, batteries, whatever, cannot
be relied on, then please do not include me in the CC list.
Daniel
> So you could potentially connect to a _huge_ disk array and write deltas
> to it. The disk array would have to support roughly 3 Gbytes/second of
> write bandwidth to keep up with the Violin ramdisk. Doable, but you are
> now in the serious heavy iron zone.
Mirroring yes, snapshotting no.
> Personally, I like my nice simple design a lot more.
So we've all noticed
Alan
> You did not explain how your proposal will avoid dropping the transaction
I did but I've attached a much more complete explanation below.
> throughput down to disk speed, whereas I have explained how my existing
> design can be made to achieve enterprise-grade data safety.
Here is a simple but high physical storage using approach (but hey disks
are cheap)
You walk across the ram dirty table writing out chunks to backing
store 0.
At some point in time you want a consistent snapshot so you pick the next
write barrier point after this time and begin committing blocks dirtied
after that moment to store 1 (with blocks before that moment being
written to both). You don't permit more than one snapshot to be in
progress at once so at some point you clear all the blocks for store 0.
Your snapshotting interval is bounded by the time to write out the store,
nor do you have to throttle writes to the ramdisk.
You now have a consistent snapshot in store 0. At the next time interval
we finish off store 1 and spew new blocks to store 2, after 2 is complete
we go with 2, 0 and then 1 as the stable store.
The only other real trick needed then is metadata, but you don't have to
update that on disk too often and you only need two bits for each of the
page in RAM.
For any page it is either
00 Clean on stable store
01 Clean on current writing snapshot
10 Dirty on stable store (and thus both)
11 Dirty on current writing snapshot (but clean, old on stable)
Pages go 00->11 or 01->11 when they are touched, 11->01 or 10->01 when
they are written back.
At the point we freeze a snapshot we move 01->00 11->10 00->11 and there
are no pages in 10. And of course we don't update the big tables at this
instant instead we store the page state as
(value - cycle_count)&3
with each freeze moment doing
cycle_count++;
The 00->11 is perhaps not obvious but the logic is fairly simple. The
snapshot we are building does not magically contain the stable data from
a previous snapshot.
Say 0 is our stable snapshot
snapshot 0 page 0 contains the stable copy of a page
snapshot 1 is currently being updated
if we touched the page during the lifetime of snapshot 1 the newer data
will be written to snapshot 1, if not then snapshot 1 does not contain
useful data (it is stale). What we must not do is permit a situation to
occur where snapshot 0 is overwritten and holds the last stable copy of a
block. If we move from "clean on stable" store to "dirty on stable store"
then we know our worst case is
Written on snapshot 0 (01)
Not written to snapshot 1 (00)
Dirty on current snapshot (11)
Written on snapshot 2 (01)
The page sweeping algorithm is
00 -> do nothing (it may be cheaper to write the blocks and go to 01)
01 -> do nothing (ditto)
10 -> write to both stable and current snapshot, move to 01
11 -> write to current snapshot move to 01
adjust dirty counts, check if ready to flip.
The recovery algorithm is
Read state snapshot number
Read blocks from stable snapshot if written to it
From previous snapshot if not
Thus we need to write a 'written this snapshot' table as we update a
snapshot - but it can lag and needs only be completed when we decide the
snapshot is 'done'. Until the point we switch stable snapshots the
metadata and data for the current writeout are not used so not relevant.
And there are far more elegant ways to do this, although some I suspect
may still be patented.
> I do not think you have set out to solve the same problem I have, which
> is
"to attain the highest possible transaction throughput with enterprise
scale reliability.
Well that is the problem I am interested in solving, but not the one you
seem to be working on.
Alan
Daniel Phillips wrote:
> On Sunday 16 March 2008 23:49, [email protected] wrote:
>>> Mirroring on the other hand, makes a realtime copy of a volume, that is
>>> never out of date.
>> so just mirror to a local disk array then.
>
> Great idea. Except that the disk array has millisecond level latency,
> when what we trying to achieve is microsecond level latency.
Just a point of information, most of the mid-tier and above disk arrays
can do replication/mirroring behind the scene (i.e., you write to one
array and it takes care of replicating your write to one or more other
arrays). This behind the scene replication can be over various types of
connections - IP or fibre channel probably are the two most common paths.
That will still leave you with the normal latency for a small write to
an array which is (when you hit cache) order of 1-2 ms...
ric
On Mon, 17 Mar 2008, Daniel Phillips wrote:
> On Sunday 16 March 2008 23:49, [email protected] wrote:
>>> Mirroring on the other hand, makes a realtime copy of a volume, that is
>>> never out of date.
>>
>> so just mirror to a local disk array then.
>
> Great idea. Except that the disk array has millisecond level latency,
> when what we trying to achieve is microsecond level latency.
>
>> a local disk array has more write bandwidth than a network connection to a
>> remote machine, so if you can mirror to a remote machine you can mirror to
>> a local disk array.
>
> So you could potentially connect to a _huge_ disk array and write deltas
> to it. The disk array would have to support roughly 3 Gbytes/second of
> write bandwidth to keep up with the Violin ramdisk. Doable, but you are
> now in the serious heavy iron zone.
your network will do less then 1 Gbit/sec, so to mirror in real-time (what
you claim is trivial) you would need at least 24 network connections in
parallel. that's a LOT harder to setup then a high performance disk array.
David Lang
On Mon, 17 Mar 2008, [email protected] wrote:
> On Mon, 17 Mar 2008, Daniel Phillips wrote:
>
>> On Sunday 16 March 2008 23:49, [email protected] wrote:
>>>> Mirroring on the other hand, makes a realtime copy of a volume, that is
>>>> never out of date.
>>>
>>> so just mirror to a local disk array then.
>>
>> Great idea. Except that the disk array has millisecond level latency,
>> when what we trying to achieve is microsecond level latency.
>>
>>> a local disk array has more write bandwidth than a network connection to a
>>> remote machine, so if you can mirror to a remote machine you can mirror to
>>> a local disk array.
>>
>> So you could potentially connect to a _huge_ disk array and write deltas
>> to it. The disk array would have to support roughly 3 Gbytes/second of
>> write bandwidth to keep up with the Violin ramdisk. Doable, but you are
>> now in the serious heavy iron zone.
>
> your network will do less then 1 Gbit/sec, so to mirror in real-time (what
> you claim is trivial) you would need at least 24 network connections in
> parallel. that's a LOT harder to setup then a high performance disk array.
by the way, the only way to get this much bandwideth between two machines
is to directly connect PCI-e/16 card slots togeather. this is definantly
not commodity hardware anymore (if it's even possible, PCI-e has some very
short distance limitations)
David Lang
On Mon, Mar 17, 2008 at 10:23:10AM -0700, [email protected] wrote:
> On Mon, 17 Mar 2008, [email protected] wrote:
>
> >On Mon, 17 Mar 2008, Daniel Phillips wrote:
> >
> >>On Sunday 16 March 2008 23:49, [email protected] wrote:
> >>>>Mirroring on the other hand, makes a realtime copy of a volume, that is
> >>>>never out of date.
> >>>
> >>>so just mirror to a local disk array then.
> >>
> >>Great idea. Except that the disk array has millisecond level latency,
> >>when what we trying to achieve is microsecond level latency.
> >>
> >>>a local disk array has more write bandwidth than a network connection to
> >>>a
> >>>remote machine, so if you can mirror to a remote machine you can mirror
> >>>to
> >>>a local disk array.
> >>
> >>So you could potentially connect to a _huge_ disk array and write deltas
> >>to it. The disk array would have to support roughly 3 Gbytes/second of
> >>write bandwidth to keep up with the Violin ramdisk. Doable, but you are
> >>now in the serious heavy iron zone.
> >
> >your network will do less then 1 Gbit/sec, so to mirror in real-time (what
> >you claim is trivial) you would need at least 24 network connections in
> >parallel. that's a LOT harder to setup then a high performance disk array.
>
> by the way, the only way to get this much bandwideth between two machines
> is to directly connect PCI-e/16 card slots togeather. this is definantly
> not commodity hardware anymore (if it's even possible, PCI-e has some very
> short distance limitations)
You can do that with 3 10GE NICS, though in practise that's not easy.
Willy
Daniel Phillips wrote:
> On Monday 17 March 2008 00:14, David Newall wrote:
>
>> I think you've just tried to obfuscate the truth. As you have
>> described, replication does not provide full protection against data
>> loss; it loses all changes since last cycle. Recall that it was you who
>> introduced the word "replication", in the context of guaranteeing no
>> loss of data.
>>
>
> You are twisting words.
I don't think so.
> I may have said that replication provides a
> point-in-time copy of a volume, which is exactly what it does, no more,
> no less.
>
You said that you could achieve a certain performance, and later you
said that for reliability you could use mirroring and replication but
you never said that would lead to a performance hit. In fact you don't
seem to be able to offer performance AND robustness; for performance you
can only offer that level of robustness attainable on a single system,
which means I think even you agreed was really not up to snuff for
customers who would need the performance that you claim to achieve.
>> You still haven't investigated the benefit of your idea over a whopping
>> great buffer cache. What's the point in all of this if it turns out, as
>> Alan hinted should be the case, that a big buffer cache gives much the
>> same performance? You appear to have gone to a great deal of effort
>> without having performed quite simple yet obvious experiments.
>>
>
> A big buffer cache does not provide a guarantee that the dirty cache
> data saved to disk when line power is lost.
But the filesystem does offer a minimum level of consistency, which is
missing from what you propose. You propose writing nothing unless
line-power fails. The big buffer cache gives you all of the robustness
of the underlying filesystem and including dirty buffer writes at some
level greater than zero.
> If you just want to
> explain to me one more time that Linux, batteries, whatever, cannot
> be relied on, then please do not include me in the CC list.
I haven't said that at all, other than as an axiom (which even you have
agreed is fair) leading to comments on the results when something does
fail. You keep saying that it won't ever fail, then that it will but
that you can mitigate using redundant systems; and then you gloss over
or refuse to face the attendant performance hit. Finally, you still
have no idea whether your idea really does achieve a massive performance
boost. You've never compared like amounts of RAM, nor the unsynced
updates that most closely resemble your idea. In short, you've leaped
on what seems to you to be a good idea and steadfastly refused to
conduct even basic research. What's the point?
You say don't cc you; I say go away, do that basic research, and come
back when you have hard data. I really don't think you can ask for
fairer than that.
Daniel Phillips wrote:
> You will need:
>
> * Two Violin 1010 memory devices
>
> * Four UPS units each rated for one hour at 600 Watts
>
> * Two servers, each with at least two 8x PCI-e slots
>
> * One SAN with a bunch of 15K rpm scsi disks
>
Honestly, this isn't at all what I understood you to be talking about.
In your very first post you described ramback as "a new virtual device
with the ability to back a ramdisk by a real disk." I assumed "ramdisk"
to mean the Linux psuedo-device. I never understood that you meant an
external device, in fact I thought ramback was *instead of* a Violin.
Since you're talking about external ram, it survives operating system
crashes and of course that is not such a big deal as I was thinking.
> If the primary server goes down then we lose exactly one interesting
> data element stored only in its memory: the ramdisk dirty map. Though
> we could be much cleverer, the backup server will simply set its entire
> map to dirty when it takes over, and will duly sync back the entire
> ramdisk to media a few minutes later.
Assuming the 1010 has proper power isolation that makes sense and would
give you pretty much what was wanted. Especially as you can replace the
secondary server with a small system which does *nothing* but sync the
ramdisk back and so is far less likely to crash.
Does make you wonder why its not built into the violin ;)
Alan
On Tue, 18 Mar 2008, David Newall wrote:
> Daniel Phillips wrote:
>> You will need:
>>
>> * Two Violin 1010 memory devices
>>
>> * Four UPS units each rated for one hour at 600 Watts
>>
>> * Two servers, each with at least two 8x PCI-e slots
>>
>> * One SAN with a bunch of 15K rpm scsi disks
>>
>
>
> Honestly, this isn't at all what I understood you to be talking about.
> In your very first post you described ramback as "a new virtual device
> with the ability to back a ramdisk by a real disk." I assumed "ramdisk"
> to mean the Linux psuedo-device. I never understood that you meant an
> external device, in fact I thought ramback was *instead of* a Violin.
> Since you're talking about external ram, it survives operating system
> crashes and of course that is not such a big deal as I was thinking.
I agree. I didn't think ramback was just an advertising mechanism for
violin, I thought it was being presented as something that could be used
with enterprise hardware like the Sun x4600 (which can hold 256G of ram)
and that talk of the violin was just the claim that lots of ram is
available so this was a possibility now.
David Lang
P.S. 8xPCI-e slots are only able to do 2GB/s of data transfer, in an
earlier post Daniel claimed that you needed 3GB/s of disk to mirror to the
disk, so the scenerio listed above looses 1/3 of it's performance compared
to what was being claimed earlier.
On Mon 2008-03-17 00:25:27, Daniel Phillips wrote:
> On Monday 17 March 2008 00:14, David Newall wrote:
> > >> if you are depending on replication over the network you have just limited
> > >> your throughput to your network speed and latency.
> > >
> > > Replication does not work that way. On each replication cycle, the
> > > differences between the most recent two volume snapshots go over the
> > > network. [...]
> > > Mirroring on the other hand, makes a realtime copy of a volume, that is
> > > never out of date.
> >
> > I think you've just tried to obfuscate the truth. As you have
> > described, replication does not provide full protection against data
> > loss; it loses all changes since last cycle. Recall that it was you who
> > introduced the word "replication", in the context of guaranteeing no
> > loss of data.
>
> You are twisting words. I may have said that replication provides a
> point-in-time copy of a volume, which is exactly what it does, no more,
> no less.
>
> > You still haven't investigated the benefit of your idea over a whopping
> > great buffer cache. What's the point in all of this if it turns out, as
> > Alan hinted should be the case, that a big buffer cache gives much the
> > same performance? You appear to have gone to a great deal of effort
> > without having performed quite simple yet obvious experiments.
>
> A big buffer cache does not provide a guarantee that the dirty cache
> data saved to disk when line power is lost. If you would like to
on_battery_power:
sync
mount / -oremount sync
...will of course work okay on any reasonable system. Not on yours,
because you have to do
echo i_really_mean_sync_when_i_say_sync > /hidden/file/somewhere
sync
(...which also shows that you are cheating).
Now, will you either do your homework and show that page cache is
somehow unsuitable for your job, or just stop wasting the bandwidth
with useless rants?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sunday 23 March 2008 02:33, Pavel Machek wrote:
> Now, will you either do your homework and show that page cache is
> somehow unsuitable for your job, or just stop wasting the bandwidth
> with useless rants?
Speaking of useless rants...
You need to go read the whole thread again, you missed the main bit.
Daniel
On Tuesday 18 March 2008 06:57, Alan Cox wrote:
> > If the primary server goes down then we lose exactly one interesting
> > data element stored only in its memory: the ramdisk dirty map. Though
> > we could be much cleverer, the backup server will simply set its entire
> > map to dirty when it takes over, and will duly sync back the entire
> > ramdisk to media a few minutes later.
>
> Assuming the 1010 has proper power isolation that makes sense and would
> give you pretty much what was wanted. Especially as you can replace the
> secondary server with a small system which does *nothing* but sync the
> ramdisk back and so is far less likely to crash.
The 1010 just has power rails that connect to dual power supplies if
that is what you mean.
> Does make you wonder why its not built into the violin ;)
Perhaps adding what amounts to a server motherboard did not make sense,
when an external server can already do the job. I don't know, there is
a certain purity in the JBOM concept (just a box of memory), and no
doubt they were able to bring this beast to market earlier because of
it.
I would be far from surprised to see a later incarnation incorporate a
hard disk. Did I mention it already has Linux running inside it to do
the raid management etc?
Regards,
Daniel
On Tuesday 18 March 2008 09:36, [email protected] wrote:
> P.S. 8xPCI-e slots are only able to do 2GB/s of data transfer, in an
> earlier post Daniel claimed that you needed 3GB/s of disk to mirror to the
> disk, so the scenerio listed above looses 1/3 of it's performance compared
> to what was being claimed earlier.
Each Violin 1010 has two 8xPCI-e interfaces.
Daniel
On Mon, 31 Mar 2008, Daniel Phillips wrote:
> On Tuesday 18 March 2008 09:36, [email protected] wrote:
>> P.S. 8xPCI-e slots are only able to do 2GB/s of data transfer, in an
>> earlier post Daniel claimed that you needed 3GB/s of disk to mirror to the
>> disk, so the scenerio listed above looses 1/3 of it's performance compared
>> to what was being claimed earlier.
>
> Each Violin 1010 has two 8xPCI-e interfaces.
but if you use them for redundancy by hooking them to two machines you can
only use one of them on your active system.
David Lang