2007-01-11 02:57:09

by Aubrey Li

[permalink] [raw]
Subject: O_DIRECT question

Hi all,

Opening file with O_DIRECT flag can do the un-buffered read/write access.
So if I need un-buffered access, I have to change all of my
applications to add this flag. What's more, Some scripts like "cp
oldfile newfile" still use pagecache and buffer.
Now, my question is, is there a existing way to mount a filesystem
with O_DIRECT flag? so that I don't need to change anything in my
system. If there is no option so far, What is the right way to achieve
my purpose?

Thanks a lot.
-Aubrey


2007-01-11 03:16:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Wed, 10 Jan 2007, Linus Torvalds wrote:
>
> So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
> instead.

Side note: the only reason O_DIRECT exists is because database people are
too used to it, because other OS's haven't had enough taste to tell them
to do it right, so they've historically hacked their OS to get out of the
way.

As a result, our madvise and/or posix_fadvise interfaces may not be all
that strong, because people sadly don't use them that much. It's a sad
example of a totally broken interface (O_DIRECT) resulting in better
interfaces not getting used, and then not getting as much development
effort put into them.

So O_DIRECT not only is a total disaster from a design standpoint (just
look at all the crap it results in), it also indirectly has hurt better
interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and
clean interface to make sure we don't pollute memory unnecessarily with
cached pages after they are all done) ends up being a no-op ;/

Sad. And it's one of those self-fulfilling prophecies. Still, I hope some
day we can just rip the damn disaster out.

Linus

2007-01-11 03:06:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Aubrey wrote:
>
> Now, my question is, is there a existing way to mount a filesystem
> with O_DIRECT flag? so that I don't need to change anything in my
> system. If there is no option so far, What is the right way to achieve
> my purpose?

The right way to do it is to just not use O_DIRECT.

The whole notion of "direct IO" is totally braindamaged. Just say no.

This is your brain: O
This is your brain on O_DIRECT: .

Any questions?

I should have fought back harder. There really is no valid reason for EVER
using O_DIRECT. You need a buffer whatever IO you do, and it might as well
be the page cache. There are better ways to control the page cache than
play games and think that a page cache isn't necessary.

So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
instead.

Linus

2007-01-11 04:52:39

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 11 Jan 2007 10:57:06 +0800
Aubrey <[email protected]> wrote:

> Hi all,
>
> Opening file with O_DIRECT flag can do the un-buffered read/write access.
> So if I need un-buffered access, I have to change all of my
> applications to add this flag. What's more, Some scripts like "cp
> oldfile newfile" still use pagecache and buffer.
> Now, my question is, is there a existing way to mount a filesystem
> with O_DIRECT flag? so that I don't need to change anything in my
> system. If there is no option so far, What is the right way to achieve
> my purpose?

Not possible, basically.

O_DIRECT reads and writes must be aligned to the device's block size
(usually 512 bytes) in memory addresses, file offsets and read/write request
sizes. Very few applications will bother to do that and will hence fail if
their files are automagically opened with O_DIRECT.

2007-01-11 05:50:55

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

Firstly I want to say I'm working on no-mmu arch and uClinux.
After much of file operations VFS cache eat up all of the memory.
At this time, if an application request memory which order > 3, the
kernel will report failure.

uClinux use a memory mapped MTD driver to store rootfs, of course it's
in the ram,
So I don't need VFS cache to improve performance. And when order > 3,
__alloc_page() even doesn't try to shrunk cache and slab, just report
failure.

So my thought is remove cache, or limit it. But currently there seems
to be no way in the kernel to do it. So I want to try to use
O_DIRECT. But it seems not to be a right way.

Thanks for your suggestion about my case.

Regards,
-Aubrey

On 1/11/07, Linus Torvalds <[email protected]> wrote:
>
>
> On Thu, 11 Jan 2007, Aubrey wrote:
> >
> > Now, my question is, is there a existing way to mount a filesystem
> > with O_DIRECT flag? so that I don't need to change anything in my
> > system. If there is no option so far, What is the right way to achieve
> > my purpose?
>
> The right way to do it is to just not use O_DIRECT.
>
> The whole notion of "direct IO" is totally braindamaged. Just say no.
>
> This is your brain: O
> This is your brain on O_DIRECT: .
>
> Any questions?
>
> I should have fought back harder. There really is no valid reason for EVER
> using O_DIRECT. You need a buffer whatever IO you do, and it might as well
> be the page cache. There are better ways to control the page cache than
> play games and think that a page cache isn't necessary.
>
> So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
> instead.
>
> Linus
>

2007-01-11 06:06:41

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 11 Jan 2007 13:50:53 +0800
Aubrey <[email protected]> wrote:

> Firstly I want to say I'm working on no-mmu arch and uClinux.
> After much of file operations VFS cache eat up all of the memory.
> At this time, if an application request memory which order > 3, the
> kernel will report failure.

nommu kernels should probably run reclaim for higher-order allocations as
well.

That's rather a blunt instrument. The "lumpy reclaim" patches in -mm
provide a much better approach, but they need more work yet (although I
don't immediately recall what's needed).

In the interim you could do the old "echo 3 > /proc/sys/vm/drop_caches"
thing, but that's terribly crude - drop_caches is really only for debugging
and benchmarking.

2007-01-11 06:10:17

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Wed, 10 Jan 2007, Linus Torvalds wrote:
>
>>So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
>>instead.
>
>
> Side note: the only reason O_DIRECT exists is because database people are
> too used to it, because other OS's haven't had enough taste to tell them
> to do it right, so they've historically hacked their OS to get out of the
> way.
>
> As a result, our madvise and/or posix_fadvise interfaces may not be all
> that strong, because people sadly don't use them that much. It's a sad
> example of a totally broken interface (O_DIRECT) resulting in better
> interfaces not getting used, and then not getting as much development
> effort put into them.
>
> So O_DIRECT not only is a total disaster from a design standpoint (just
> look at all the crap it results in), it also indirectly has hurt better
> interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and
> clean interface to make sure we don't pollute memory unnecessarily with
> cached pages after they are all done) ends up being a no-op ;/
>
> Sad. And it's one of those self-fulfilling prophecies. Still, I hope some
> day we can just rip the damn disaster out.

Speaking of which, why did we obsolete raw devices? And/or why not just
go with a minimal O_DIRECT on block device support? Not a rhetorical
question -- I wasn't involved in the discussions when they happened, so
I would be interested.

O_DIRECT is still crazily racy versus pagecache operations. Chris Mason's
recent patches to attempt to fix it, while actually doing quite a fine
job, are very intrusive and complex for such a sad corner case.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 06:16:16

by Alexander Shishkin

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Aubrey <[email protected]> wrote:
> Firstly I want to say I'm working on no-mmu arch and uClinux.
> After much of file operations VFS cache eat up all of the memory.
> At this time, if an application request memory which order > 3, the
> kernel will report failure.
>
> uClinux use a memory mapped MTD driver to store rootfs, of course it's
> in the ram,
> So I don't need VFS cache to improve performance. And when order > 3,
> __alloc_page() even doesn't try to shrunk cache and slab, just report
> failure.
>
> So my thought is remove cache, or limit it. But currently there seems
> to be no way in the kernel to do it. So I want to try to use
> O_DIRECT. But it seems not to be a right way.
One possibility might be to poke the open method in struct
file_operations of your fs like

static int my_open_file(struct inode *inode, struct file *filp)
{
filp->f_flags |= O_DIRECT;
...
}

which is a nasty thing to do but might give you an idea of what happens next.

Regards,
--
I like long walks, especially when they are taken by people who annoy me.

2007-01-11 06:45:15

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Andrew Morton <[email protected]> wrote:
> On Thu, 11 Jan 2007 13:50:53 +0800
> Aubrey <[email protected]> wrote:
>
> > Firstly I want to say I'm working on no-mmu arch and uClinux.
> > After much of file operations VFS cache eat up all of the memory.
> > At this time, if an application request memory which order > 3, the
> > kernel will report failure.
>
> nommu kernels should probably run reclaim for higher-order allocations as
> well.

Here is the limitation. rebalance doesn't occur if order > 3.
/*
* Don't let big-order allocations loop unless the caller explicitly
* requests that. Wait for some write requests to complete then retry.
*
* In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
* <= 3, but that may not be true in other implementations.
*/
do_retry = 0;
if (!(gfp_mask & __GFP_NORETRY)) {
if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
do_retry = 1;
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
}
if (do_retry) {
blk_congestion_wait(WRITE, HZ/50);
goto rebalance;
}

>
> That's rather a blunt instrument. The "lumpy reclaim" patches in -mm
> provide a much better approach, but they need more work yet (although I
> don't immediately recall what's needed).

Thanks, I'll take a look.

>
> In the interim you could do the old "echo 3 > /proc/sys/vm/drop_caches"
> thing, but that's terribly crude - drop_caches is really only for debugging
> and benchmarking.
>
Yes. This method can drop caches, but will fragment memory. This is
not what I want. I want cache is limited to a tunable value of the
whole memory. For example, if total memory is 128M, is there a way to
trigger reclaim when cache size > 16M?

-Aubrey

2007-01-11 06:57:26

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Alexander Shishkin <[email protected]> wrote:
> On 1/11/07, Aubrey <[email protected]> wrote:
> > Firstly I want to say I'm working on no-mmu arch and uClinux.
> > After much of file operations VFS cache eat up all of the memory.
> > At this time, if an application request memory which order > 3, the
> > kernel will report failure.
> >
> > uClinux use a memory mapped MTD driver to store rootfs, of course it's
> > in the ram,
> > So I don't need VFS cache to improve performance. And when order > 3,
> > __alloc_page() even doesn't try to shrunk cache and slab, just report
> > failure.
> >
> > So my thought is remove cache, or limit it. But currently there seems
> > to be no way in the kernel to do it. So I want to try to use
> > O_DIRECT. But it seems not to be a right way.
> One possibility might be to poke the open method in struct
> file_operations of your fs like
>
> static int my_open_file(struct inode *inode, struct file *filp)
> {
> filp->f_flags |= O_DIRECT;
> ...
> }
>
> which is a nasty thing to do but might give you an idea of what happens next.

Thanks, I'll try to see what happens next on my side.

-Aubrey

2007-01-11 06:57:57

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 11 Jan 2007 14:45:12 +0800
Aubrey <[email protected]> wrote:

> >
> > In the interim you could do the old "echo 3 > /proc/sys/vm/drop_caches"
> > thing, but that's terribly crude - drop_caches is really only for debugging
> > and benchmarking.
> >
> Yes. This method can drop caches, but will fragment memory.

That's what page reclaim will do as well.

What you want is Mel's antifragmentation work, or lumpy reclaim.

> This is
> not what I want. I want cache is limited to a tunable value of the
> whole memory. For example, if total memory is 128M, is there a way to
> trigger reclaim when cache size > 16M?

If there was, it'd "fragment memory" as well.

You might get a little benefit from increasing /proc/sys/vm/min_free_kbytes,
but not much. Some page allocation tweaks would aid that.

But basically, to do this well, serious work is needed.

2007-01-11 07:05:52

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Andrew Morton wrote:
> On Thu, 11 Jan 2007 14:45:12 +0800
> Aubrey <[email protected]> wrote:
>
>
>>>In the interim you could do the old "echo 3 > /proc/sys/vm/drop_caches"
>>>thing, but that's terribly crude - drop_caches is really only for debugging
>>>and benchmarking.
>>>
>>
>>Yes. This method can drop caches, but will fragment memory.
>
>
> That's what page reclaim will do as well.
>
> What you want is Mel's antifragmentation work, or lumpy reclaim.
>
>
>>This is
>>not what I want. I want cache is limited to a tunable value of the
>>whole memory. For example, if total memory is 128M, is there a way to
>>trigger reclaim when cache size > 16M?
>
>
> If there was, it'd "fragment memory" as well.
>
> You might get a little benefit from increasing /proc/sys/vm/min_free_kbytes,
> but not much. Some page allocation tweaks would aid that.
>
> But basically, to do this well, serious work is needed.

OTOH, the antifragmentation stuff can also break down eventually,
especially if higher order allocations are actually in common use.

What you _really_ want to do is avoid large mallocs after boot, or use
a CPU with an mmu. I don't think nommu linux was ever intended to be a
simple drop in replacement for a normal unix kernel.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 07:54:09

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Nick Piggin <[email protected]> wrote:
> Andrew Morton wrote:
> > On Thu, 11 Jan 2007 14:45:12 +0800
> > Aubrey <[email protected]> wrote:
> >
> >
> >>>In the interim you could do the old "echo 3 > /proc/sys/vm/drop_caches"
> >>>thing, but that's terribly crude - drop_caches is really only for debugging
> >>>and benchmarking.
> >>>
> >>
> >>Yes. This method can drop caches, but will fragment memory.
> >
> >
> > That's what page reclaim will do as well.
> >
> > What you want is Mel's antifragmentation work, or lumpy reclaim.
> >
> >
> >>This is
> >>not what I want. I want cache is limited to a tunable value of the
> >>whole memory. For example, if total memory is 128M, is there a way to
> >>trigger reclaim when cache size > 16M?
> >
> >
> > If there was, it'd "fragment memory" as well.
> >
> > You might get a little benefit from increasing /proc/sys/vm/min_free_kbytes,
> > but not much. Some page allocation tweaks would aid that.
> >
> > But basically, to do this well, serious work is needed.
>
> OTOH, the antifragmentation stuff can also break down eventually,
> especially if higher order allocations are actually in common use.

That's right. When VFS cache eat up almost all of the memory, I think
no memory algorithm can help the case, including Mei's anti-fragment
patch.

>
> What you _really_ want to do is avoid large mallocs after boot, or use
> a CPU with an mmu. I don't think nommu linux was ever intended to be a
> simple drop in replacement for a normal unix kernel.

Is there a position available working on mmu CPU? Joking, :)
Yes, some problems are serious on nommu linux. But I think we should
try to fix them not avoid them.

-Aubrey

2007-01-11 08:05:21

by Roy Huang

[permalink] [raw]
Subject: Re: O_DIRECT question

On a embedded systerm, limiting page cache can relieve memory
fragmentation. There is a patch against 2.6.19, which limit every
opened file page cache and total pagecache. When the limit reach, it
will release the page cache overrun the limit.


Index: include/linux/pagemap.h
===================================================================
--- include/linux/pagemap.h (revision 2628)
+++ include/linux/pagemap.h (working copy)
@@ -12,6 +12,7 @@
#include <asm/uaccess.h>
#include <linux/gfp.h>

+extern int total_pagecache_limit;
/*
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
* allocation mode flags.
Index: include/linux/fs.h
===================================================================
--- include/linux/fs.h (revision 2628)
+++ include/linux/fs.h (working copy)
@@ -444,6 +444,10 @@
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_LIMIT_PAGECACHE
+ unsigned long pages_limit;
+ struct list_head page_head;
+#endif
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
Index: include/linux/mm.h
===================================================================
--- include/linux/mm.h (revision 2628)
+++ include/linux/mm.h (working copy)
@@ -231,6 +231,9 @@
#else
#define VM_BUG_ON(condition) do { } while(0)
#endif
+#ifdef CONFIG_LIMIT_PAGECACHE
+ struct list_head page_list;
+#endif

/*
* Methods to modify the page usage count.
@@ -1030,7 +1033,21 @@

/* mm/page-writeback.c */
int write_one_page(struct page *page, int wait);
+/* possible outcome of pageout() */

+typedef enum {
+ /* failed to write page out, page is locked */
+ PAGE_KEEP,
+ /* move page to the active list, page is locked */
+ PAGE_ACTIVATE,
+ /* page has been sent to the disk successfully, page is unlocked */
+ PAGE_SUCCESS,
+ /* page is clean and locked */
+ PAGE_CLEAN,
+} pageout_t;
+
+pageout_t pageout(struct page *page, struct address_space *mapping);
+
/* readahead.c */
#define VM_MAX_READAHEAD 128 /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
Index: init/Kconfig
===================================================================
--- init/Kconfig (revision 2628)
+++ init/Kconfig (working copy)
@@ -419,6 +419,19 @@
option replaces shmem and tmpfs with the much simpler ramfs code,
which may be appropriate on small systems without swap.

+config LIMIT_PAGECACHE
+ bool "Limit page caches" if EMBEDDED
+
+config PAGECACHE_LIMIT
+ int "Page cache limit for every file in page unit"
+ depends on LIMIT_PAGECACHE
+ default 32
+
+config PAGECACHE_LIMIT_TOTAL
+ int "Total page cache limit in MB unit"
+ depends on LIMIT_PAGECACHE
+ default 10
+
choice
prompt "Page frame management algorithm"
default BUDDY
Index: fs/inode.c
===================================================================
--- fs/inode.c (revision 2628)
+++ fs/inode.c (working copy)
@@ -205,6 +205,10 @@
INIT_LIST_HEAD(&inode->inotify_watches);
mutex_init(&inode->inotify_mutex);
#endif
+#ifdef CONFIG_LIMIT_PAGECACHE
+ INIT_LIST_HEAD(&inode->i_data.page_head);
+ inode->i_data.pages_limit = CONFIG_PAGECACHE_LIMIT;
+#endif
}

EXPORT_SYMBOL(inode_init_once);
Index: mm/filemap.c
===================================================================
--- mm/filemap.c (revision 2628)
+++ mm/filemap.c (working copy)
@@ -18,6 +18,7 @@
#include <linux/capability.h>
#include <linux/kernel_stat.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
@@ -30,6 +31,9 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
+#include <linux/rmap.h>
+#include <linux/buffer_head.h>
+#include <linux/page-flags.h>
#include "filemap.h"
#include "internal.h"

@@ -119,6 +123,9 @@
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
+#ifdef CONFIG_LIMIT_PAGECACHE
+ list_del_init(&page->page_list);
+#endif
__dec_zone_page_state(page, NR_FILE_PAGES);
}

@@ -169,6 +176,96 @@
return 0;
}

+#ifdef CONFIG_LIMIT_PAGECACHE
+static void balance_cache(struct address_space *mapping)
+{
+ /* Release half of the pages */
+ int count ;
+ int nr_released = 0;
+ struct page *page;
+ struct zone *zone= NULL;
+ struct pagevec freed_pvec;
+ struct list_head ret_list;
+
+ count = mapping->nrpages /2;
+ pagevec_init(&freed_pvec, 0);
+ INIT_LIST_HEAD(&ret_list);
+ lru_add_drain();
+ while(count-->0) {
+ page = list_entry(mapping->page_head.prev, struct page, page_list);
+ zone = page_zone(page);
+ TestClearPageLRU(page);
+ if (PageActive(page))
+ del_page_from_active_list(zone, page);
+ else
+ del_page_from_inactive_list(zone, page);
+
+ list_del_init(&page->page_list); /* Remove from current process's
page list */
+ get_page(page);
+
+ if (TestSetPageLocked(page))
+ goto __keep;
+ if (PageWriteback(page))
+ goto __keep_locked;
+ if (page_referenced(page, 1))
+ goto __keep_locked;
+ if (PageDirty(page)) {
+ switch(pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto __keep_locked;
+ case PAGE_SUCCESS:
+ if (PageWriteback(page) || PageDirty(page))
+ goto __keep;
+ if (TestSetPageLocked(page))
+ goto __keep;
+ if (PageDirty(page) || PageWriteback(page))
+ goto __keep_locked;
+ case PAGE_CLEAN:
+ ;
+ }
+ }
+
+ if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
+ goto __keep_locked;
+ if (!remove_mapping(mapping, page))
+ goto __keep_locked;
+
+ unlock_page(page);
+ nr_released++;
+ /* This page maybe in Active LRU */
+ ClearPageActive(page);
+ ClearPageUptodate(page);
+ if (!pagevec_add(&freed_pvec, page))
+ __pagevec_release_nonlru(&freed_pvec);
+ continue;
+__keep_locked:
+ unlock_page(page);
+__keep:
+ SetPageLRU(page);
+ if (PageActive(page)) {
+ add_page_to_active_list(zone, page);
+ } else {
+ add_page_to_inactive_list(zone, page);
+ }
+
+ list_add(&page->page_list, &ret_list);
+ }
+ while(!list_empty(&ret_list)) {
+ page = list_entry(ret_list.prev, struct page, page_list);
+ list_move_tail(&page->page_list, &mapping->page_head);
+ put_page(page);
+ }
+ if (pagevec_count(&freed_pvec))
+ __pagevec_release_nonlru(&freed_pvec);
+
+ if (global_page_state(NR_FILE_PAGES) > total_pagecache_limit)
+ if (zone) {
+ wakeup_kswapd(zone, 0);
+ }
+}
+#endif
+
/**
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
* @mapping: address space structure to write
@@ -448,6 +545,10 @@
page->mapping = mapping;
page->index = offset;
mapping->nrpages++;
+#ifdef CONFIG_LIMIT_PAGECACHE
+ list_add(&page->page_list, &mapping->page_head);
+#endif
+
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(&mapping->tree_lock);
@@ -1085,6 +1186,10 @@
page_cache_release(cached_page);
if (filp)
file_accessed(filp);
+#ifdef CONFIG_LIMIT_PAGECACHE
+ if (mapping->nrpages >= mapping->pages_limit)
+ balance_cache(mapping);
+#endif
}
EXPORT_SYMBOL(do_generic_mapping_read);

@@ -2195,6 +2300,11 @@
if (cached_page)
page_cache_release(cached_page);

+#ifdef CONFIG_LIMIT_PAGECACHE
+ if (mapping->nrpages >= mapping->pages_limit)
+ balance_cache(mapping);
+#endif
+
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
Index: mm/vmscan.c
===================================================================
--- mm/vmscan.c (revision 2628)
+++ mm/vmscan.c (working copy)
@@ -116,6 +116,7 @@

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+int total_pagecache_limit = CONFIG_PAGECACHE_LIMIT_TOTAL * 1024 / 4;

/*
* Add a shrinker callback to be called from the vm
@@ -292,23 +293,11 @@
unlock_page(page);
}

-/* possible outcome of pageout() */
-typedef enum {
- /* failed to write page out, page is locked */
- PAGE_KEEP,
- /* move page to the active list, page is locked */
- PAGE_ACTIVATE,
- /* page has been sent to the disk successfully, page is unlocked */
- PAGE_SUCCESS,
- /* page is clean and locked */
- PAGE_CLEAN,
-} pageout_t;
-
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
*/
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+pageout_t pageout(struct page *page, struct address_space *mapping)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -1328,7 +1317,11 @@
order = pgdat->kswapd_max_order;
}
finish_wait(&pgdat->kswapd_wait, &wait);
- balance_pgdat(pgdat, order);
+ if (global_page_state(NR_FILE_PAGES) >= total_pagecache_limit)
+ balance_pgdat(pgdat, (global_page_state(NR_FILE_PAGES) \
+ - total_pagecache_limit), order);
+ else
+ balance_pgdat(pgdat, order);
}
return 0;
}
@@ -1344,8 +1337,10 @@
return;

pgdat = zone->zone_pgdat;
- if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
- return;
+ if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0)) {
+ if (global_page_state(NR_FILE_PAGES) < total_pagecache_limit)
+ return;
+ }
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))

2007-01-11 08:12:41

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Aubrey wrote:
> On 1/11/07, Nick Piggin <[email protected]> wrote:

>> What you _really_ want to do is avoid large mallocs after boot, or use
>> a CPU with an mmu. I don't think nommu linux was ever intended to be a
>> simple drop in replacement for a normal unix kernel.
>
>
> Is there a position available working on mmu CPU? Joking, :)
> Yes, some problems are serious on nommu linux. But I think we should
> try to fix them not avoid them.

Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE
mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do things
like limit cache size that are the bandaids.

Of course, being an embedded system, if they work for you then that's
really fine and you can obviously ship with them. But they don't need to
go upstream.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 08:50:03

by Roy Huang

[permalink] [raw]
Subject: Re: O_DIRECT question

There is already an EMBEDDED option in config, so I think linux is
also supporting embedded system. There are many developers working on
embedded system runing linux. They also hope to contribute to linux,
then other embeded developers can share it.

On 1/11/07, Nick Piggin <[email protected]> wrote:
> Aubrey wrote:
> > On 1/11/07, Nick Piggin <[email protected]> wrote:
>
> >> What you _really_ want to do is avoid large mallocs after boot, or use
> >> a CPU with an mmu. I don't think nommu linux was ever intended to be a
> >> simple drop in replacement for a normal unix kernel.
> >
> >
> > Is there a position available working on mmu CPU? Joking, :)
> > Yes, some problems are serious on nommu linux. But I think we should
> > try to fix them not avoid them.
>
> Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE
> mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do things
> like limit cache size that are the bandaids.
>
> Of course, being an embedded system, if they work for you then that's
> really fine and you can obviously ship with them. But they don't need to
> go upstream.
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-11 08:58:53

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: O_DIRECT question


On Wed, 10 Jan 2007 20:51:57 PST, Andrew Morton wrote:
> On Thu, 11 Jan 2007 10:57:06 +0800
> Aubrey <[email protected]> wrote:
>
> > Hi all,
> >
> > Opening file with O_DIRECT flag can do the un-buffered read/write access.
> > So if I need un-buffered access, I have to change all of my
> > applications to add this flag. What's more, Some scripts like "cp
> > oldfile newfile" still use pagecache and buffer.
> > Now, my question is, is there a existing way to mount a filesystem
> > with O_DIRECT flag? so that I don't need to change anything in my
> > system. If there is no option so far, What is the right way to achieve
> > my purpose?
>
> Not possible, basically.
>
> O_DIRECT reads and writes must be aligned to the device's block size
> (usually 512 bytes) in memory addresses, file offsets and read/write request
> sizes. Very few applications will bother to do that and will hence fail if
> their files are automagically opened with O_DIRECT.

Actually, technically possible. We heard from some application people
that Sun/Solaris has this option. Good if the application is the only
one using the filesystem. Supposedly there were large apps which used
lots of filesystems more or less exclusively and this option made people
happy.

Although before Linus says it, I guess crack makes people happy, too. ;)

gerrit

2007-01-11 09:09:55

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Roy Huang wrote:
> There is already an EMBEDDED option in config, so I think linux is
> also supporting embedded system. There are many developers working on
> embedded system runing linux. They also hope to contribute to linux,
> then other embeded developers can share it.

Yes, but we don't like to apply kernel hacks regardless if they "help"
embedded, desktop, server or anyone else.

Note that these tricks to limit file cache will only paper over the
actual issue, which is that large allocations must be contiguous and
thus subject to fragmentation on nommu kernels. The real solution to
this is simply not a kernel based one.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-11 12:14:28

by Viktor

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Thu, 11 Jan 2007, Aubrey wrote:
>
>>Now, my question is, is there a existing way to mount a filesystem
>>with O_DIRECT flag? so that I don't need to change anything in my
>>system. If there is no option so far, What is the right way to achieve
>>my purpose?
>
>
> The right way to do it is to just not use O_DIRECT.
>
> The whole notion of "direct IO" is totally braindamaged. Just say no.
>
> This is your brain: O
> This is your brain on O_DIRECT: .
>
> Any questions?
>
> I should have fought back harder. There really is no valid reason for EVER
> using O_DIRECT. You need a buffer whatever IO you do, and it might as well
> be the page cache. There are better ways to control the page cache than
> play games and think that a page cache isn't necessary.
>
> So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
> instead.

OK, madvise() used with mmap'ed file allows to have reads from a file
with zero-copy between kernel/user buffers and don't pollute cache
memory unnecessarily. But how about writes? How is to do zero-copy
writes to a file and don't pollute cache memory without using O_DIRECT?
Do I miss the appropriate interface?

> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2007-01-11 12:34:18

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: O_DIRECT question


On Wed, 10 Jan 2007, Aubrey wrote:

> Hi all,
>
> Opening file with O_DIRECT flag can do the un-buffered read/write access.
> So if I need un-buffered access, I have to change all of my
> applications to add this flag. What's more, Some scripts like "cp
> oldfile newfile" still use pagecache and buffer.
> Now, my question is, is there a existing way to mount a filesystem
> with O_DIRECT flag? so that I don't need to change anything in my
> system. If there is no option so far, What is the right way to achieve
> my purpose?
>
> Thanks a lot.
> -Aubrey
> -

I don't think O_DIRECT ever did what a lot of folks expect, i.e.,
write this buffer of data to the physical device _now_. All I/O
ends up being buffered. The `man` page states that the I/O will
be synchronous, that at the conclusion of the call, data will have
been transferred. However, the data written probably will not be
in the physical device, perhaps only in a DMA-able buffer with
a promise to get it to the SCSI device, soon.

Maybe you need to say why you want to use O_DIRECT with its terrible
performance?

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.72 BogoMips).
New book: http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2007-01-11 12:45:25

by Erik Mouw

[permalink] [raw]
Subject: Re: O_DIRECT question

On Wed, Jan 10, 2007 at 07:05:30PM -0800, Linus Torvalds wrote:
> I should have fought back harder. There really is no valid reason for EVER
> using O_DIRECT. You need a buffer whatever IO you do, and it might as well
> be the page cache. There are better ways to control the page cache than
> play games and think that a page cache isn't necessary.

There is a valid reason: you really don't want to go through the page
cache when a hard drive has bad blocks. The only way to get fast
recovery and correct error reporting to userspace is by using O_DIRECT.

> So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
> instead.

Both don't do what I want it to do: only read the sector I request you
to read and certainly do not try to outsmart me by doing some kind of
readahead.


Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

2007-01-11 13:06:58

by Martin Mares

[permalink] [raw]
Subject: Re: O_DIRECT question

Hello!

> Maybe you need to say why you want to use O_DIRECT with its terrible
> performance?

Incidentally, I was writing an external-memory radix-sort some time ago
and it turned out that writing to 256 files at once is much faster with
O_DIRECT than through the page cache, very likely because the page cache
is flushing pages in essentially random order. Tweaking VM parameters
and block device queue size helped, but only a little.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://mj.ucw.cz/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Linux vs. Windows is a no-WIN situation.

2007-01-11 13:26:06

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

Nick Piggin <[email protected]> wrote:
> Aubrey wrote:
>> On 1/11/07, Nick Piggin <[email protected]> wrote:

>>> What you _really_ want to do is avoid large mallocs after boot, or use
>>> a CPU with an mmu. I don't think nommu linux was ever intended to be a
>>> simple drop in replacement for a normal unix kernel.
>>
>>
>> Is there a position available working on mmu CPU? Joking, :)
>> Yes, some problems are serious on nommu linux. But I think we should
>> try to fix them not avoid them.
>
> Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE
> mallocs[*] if it is to be nommu friendly.

IMO it's better to go back to a 16-bit segmented system like 80286 than
to artificially limit yourself to 12 bit memory chunks. Even if you don't
have segments, offering a DOS or old MacOS-like memory management
(allocating a fixed block on program start) is way better than "We want
to cache, no matter what it costs".

If you throw away the cache, maybe you'll be slow, but if you throw away
the application, you'll go backwards. If the cache is a problem, allocate
one block of cache, and everybody will be happy. Maybe it's oldscool, but
it works.

> It is the kernel hacks to do things
> like limit cache size that are the bandaids.

Limiting the cache is a feature, since it avoids constantly swapping out
e.g. X11's keyboard mouse routines just because you opened a large picture
in gimp*. Playing with the provided knobs did help for some cases, but I
didn't succeed for the serious cases, and I'm not the world's dumbest
computer user. Having a simple knob "don't grow larger than $num if you
have to evict programs, don't go below $num2" would be THE knob any joe
luser can understand and mostly DTRT or at least what you'd expect.



* I hope the vm_pps will help, but I did not yet read it's docs

2007-01-11 14:14:57

by Jens Axboe

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, Jan 11 2007, linux-os (Dick Johnson) wrote:
>
> On Wed, 10 Jan 2007, Aubrey wrote:
>
> > Hi all,
> >
> > Opening file with O_DIRECT flag can do the un-buffered read/write access.
> > So if I need un-buffered access, I have to change all of my
> > applications to add this flag. What's more, Some scripts like "cp
> > oldfile newfile" still use pagecache and buffer.
> > Now, my question is, is there a existing way to mount a filesystem
> > with O_DIRECT flag? so that I don't need to change anything in my
> > system. If there is no option so far, What is the right way to achieve
> > my purpose?
> >
> > Thanks a lot.
> > -Aubrey
> > -
>
> I don't think O_DIRECT ever did what a lot of folks expect, i.e.,
> write this buffer of data to the physical device _now_. All I/O
> ends up being buffered. The `man` page states that the I/O will
> be synchronous, that at the conclusion of the call, data will have
> been transferred. However, the data written probably will not be
> in the physical device, perhaps only in a DMA-able buffer with
> a promise to get it to the SCSI device, soon.

Thanks for your guessing, but O_DIRECT is with the physical drive once
the call returns. Whether it's in drive cache or on drive platters is a
different story, but from the OS point of view, it's definitely with the
drive.

--
Jens Axboe

2007-01-11 15:51:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Nick Piggin wrote:
>
> Speaking of which, why did we obsolete raw devices? And/or why not just
> go with a minimal O_DIRECT on block device support? Not a rhetorical
> question -- I wasn't involved in the discussions when they happened, so
> I would be interested.

Lots of people want to put their databases in a file. Partitions really
weren't nearly flexible enough. So the whole raw device or O_DIRECT just
to the block device thing isn't really helping any.

> O_DIRECT is still crazily racy versus pagecache operations.

Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
it sanely. Except by teaching people not to use it, and making the normal
paths fast enough (and that _includes_ doing things like dropping caches
more aggressively, but it probably would include more work on the device
queue merging stuff etc etc).

The "good" news is that CPU really is outperforming disk more and more, so
the extra cost of managing the page cache keeps on getting smaller and
smaller, and (fingers crossed) some day we can hopefully just drop
O_DIRECT and nobody will care.

Linus

2007-01-11 15:53:35

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Viktor wrote:
> OK, madvise() used with mmap'ed file allows to have reads from a file
> with zero-copy between kernel/user buffers and don't pollute cache
> memory unnecessarily. But how about writes? How is to do zero-copy
> writes to a file and don't pollute cache memory without using O_DIRECT?
> Do I miss the appropriate interface?

Not only that but mmap/madvise do not allow you to perform async io.
You still have to just try and touch the memory and take the page fault
to cause a read. This is unacceptable to an application that is trying
to manage multiple IO streams and keep the pipelines full; it needs to
block only when there is no more work it can do.

Even with only a single io stream the application needs to keep one side
reading and one side writing. To make this work without O_DIRECT would
require a method to asynchronously fault pages in, and the kernel would
have to utilize that method when processing aio writes so as not to
block the calling process if the buffer it asked to write is not in
memory.

I was rather disappointed years ago by NT's lack of an async page fault
path, which prevented me from developing an ftp server that could do
zero copy IO, but still use the filesystem cache and avoid NT's version
of O_DIRECT.


2007-01-11 16:10:57

by Badari Pulavarty

[permalink] [raw]
Subject: Re: O_DIRECT question

On Wed, 2007-01-10 at 20:51 -0800, Andrew Morton wrote:
> On Thu, 11 Jan 2007 10:57:06 +0800
> Aubrey <[email protected]> wrote:
>
> > Hi all,
> >
> > Opening file with O_DIRECT flag can do the un-buffered read/write access.
> > So if I need un-buffered access, I have to change all of my
> > applications to add this flag. What's more, Some scripts like "cp
> > oldfile newfile" still use pagecache and buffer.
> > Now, my question is, is there a existing way to mount a filesystem
> > with O_DIRECT flag? so that I don't need to change anything in my
> > system. If there is no option so far, What is the right way to achieve
> > my purpose?
>
> Not possible, basically.
>
> O_DIRECT reads and writes must be aligned to the device's block size
> (usually 512 bytes) in memory addresses, file offsets and read/write request
> sizes. Very few applications will bother to do that and will hence fail if
> their files are automagically opened with O_DIRECT.

I worked on patches to take away the 512-byte restriction for memory
addresses a while ago - but had to use 4-byte alignment for some
drivers to make it work. I gave up since, there was no way for me
to methodically prove that it will work on all drivers :(

O_DIRECT mount option is the one our software group people keep
asking for (since they use it on Solaris) - we keep telling them
NO !! Their basic complaint *was* - use/population of pagecache
by other applications (tar, ftp, scp, backup) in the system,
causing performance degrade for their application. But again,
2.6.x had gotten lot better and we have hundreds of tunables
to control various behaviours and problem can be *theoritically*
worked around with these tunables.

Thanks,
Badari

2007-01-11 16:19:48

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Linus Torvalds <[email protected]> wrote:
>
> The "good" news is that CPU really is outperforming disk more and more, so
> the extra cost of managing the page cache keeps on getting smaller and
> smaller, and (fingers crossed) some day we can hopefully just drop
> O_DIRECT and nobody will care.
>
> Linus
>
Yes for desktop, server, but maybe not for embedded system, specially
for no-mmu linux. In many embedded system cases, the whole system is
running in the ram, including file system. So it's not necessary using
page cache anymore. Page cache can't improve performance on these
cases, but only fragment memory.
Maybe O_DIRECT is not a right way to fix this issue. But I think file
system need an option for un-buffered access, that means don't use
page cache at all.

-Aubrey

P.S. The following is the test case and crash info. I think it will
help what exactly I encountered.
------------------------------------
#include <stdio.h>
#include <stdlib.h>
#define N 8

int main (void){
void *p[N];
int i;

printf("Alloc %d MB !\n", N);

for (i = 0; i < N; i++) {
p[i] = malloc(1024 * 1024);
if (p[i] == NULL)
printf("alloc failed\n");
}

printf("alloc successful \n");
for (i = 0; i < N; i++)
free(p[i]);
}
--------------------------------------------------------------

When there is not enough free memory to allocate:
==============================
root:/mnt> cat /proc/meminfo
MemTotal: 54196 kB
MemFree: 5520 kB <== only 5M free
Buffers: 76 kB
Cached: 44696 kB <== cache eat 40MB
SwapCached: 0 kB
Active: 21092 kB
Inactive: 23680 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 54196 kB
LowFree: 5520 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 0 kB
Mapped: 0 kB
Slab: 3720 kB
PageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 27096 kB
Committed_AS: 0 kB
VmallocTotal: 0 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
==========================================


I got failure after run the test program.
---------------------------------------
root:/mnt> ./t
Alloc 8 MB !
t: page allocation failure. order:9, mode:0x40d0
Hardware Trace:
0 Target : <0x00004de0> { _dump_stack + 0x0 }
Source : <0x0003054a> { ___alloc_pages + 0x17e }
1 Target : <0x0003054a> { ___alloc_pages + 0x17e }
Source : <0x0000dbc2> { _printk + 0x16 }
2 Target : <0x0000dbbe> { _printk + 0x12 }
Source : <0x0000da4e> { _vprintk + 0x1a2 }
3 Target : <0x0000da42> { _vprintk + 0x196 }
Source : <0xffa001ea> { __common_int_entry + 0xd8 }
4 Target : <0xffa00188> { __common_int_entry + 0x76 }
Source : <0x000089bc> { _return_from_int + 0x58 }
5 Target : <0x000089bc> { _return_from_int + 0x58 }
Source : <0x00008992> { _return_from_int + 0x2e }
6 Target : <0x00008964> { _return_from_int + 0x0 }
Source : <0xffa00184> { __common_int_entry + 0x72 }
7 Target : <0xffa00182> { __common_int_entry + 0x70 }
Source : <0x00012682> { __local_bh_enable + 0x56 }
8 Target : <0x0001266c> { __local_bh_enable + 0x40 }
Source : <0x0001265c> { __local_bh_enable + 0x30 }
9 Target : <0x00012654> { __local_bh_enable + 0x28 }
Source : <0x00012644> { __local_bh_enable + 0x18 }
10 Target : <0x0001262c> { __local_bh_enable + 0x0 }
Source : <0x000128e0> { ___do_softirq + 0x94 }
11 Target : <0x000128d8> { ___do_softirq + 0x8c }
Source : <0x000128b8> { ___do_softirq + 0x6c }
12 Target : <0x000128aa> { ___do_softirq + 0x5e }
Source : <0x0001666a> { _run_timer_softirq + 0x82 }
13 Target : <0x000165fc> { _run_timer_softirq + 0x14 }
Source : <0x00023eb8> { _hrtimer_run_queues + 0xe8 }
14 Target : <0x00023ea6> { _hrtimer_run_queues + 0xd6 }
Source : <0x00023e70> { _hrtimer_run_queues + 0xa0 }
15 Target : <0x00023e68> { _hrtimer_run_queues + 0x98 }
Source : <0x00023eae> { _hrtimer_run_queues + 0xde }
Stack from 015a7dcc:
00000001 0003054e 00000000 00000001 000040d0 0013c70c 00000009 000040d0
00000000 00000080 00000000 000240d0 00000000 015a6000 015a6000 015a6000
00000010 00000000 00000001 00036e12 00000000 0023f8e0 00000073 00191e40
00000020 0023e9a0 000040d0 015afea9 015afe94 00101fff 000040d0 0023e9a0
00000010 00101fff 000370de 00000000 0363d3e0 00000073 0000ffff 04000021
00000000 00101000 00187af0 00035b44 00000000 00035e40 00000000 00000000
Call Trace:
Call Trace:
[<0000fffe>] _do_exit+0x12e/0x7cc
[<00004118>] _sys_mmap+0x54/0x98
[<00101000>] _fib_create_info+0x670/0x780
[<00008828>] _system_call+0x68/0xba
[<000040c4>] _sys_mmap+0x0/0x98
[<0000fffe>] _do_exit+0x12e/0x7cc
[<00008000>] _cplb_mgr+0x8/0x2e8
[<00101000>] _fib_create_info+0x670/0x780
[<00101000>] _fib_create_info+0x670/0x780

Mem-info:
DMA per-cpu:
cpu 0 hot: high 18, batch 3 used:5
cpu 0 cold: high 6, batch 1 used:5
DMA32 per-cpu: empty
Normal per-cpu: empty
HighMem per-cpu: empty
Free pages: 21028kB (0kB HighMem)
Active:2549 inactive:3856 dirty:0 writeback:0 unstable:0 free:5257
slab:1833 mapped:0 pagetables:0
DMA free:21028kB min:948kB low:1184kB high:1420kB active:10196kB
inactive:15424kB present:56320kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
HighMem free:0kB min:128kB low:128kB high:128kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 43*4kB 35*8kB 28*16kB 17*32kB 18*64kB 20*128kB 16*256kB 11*512kB
6*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 21028kB
DMA32: empty
Normal: empty
HighMem: empty
14080 pages of RAM
5285 free pages
531 reserved pages
11 pages shared
0 pages swap cached
Allocation of length 1052672 from process 57 failed
DMA per-cpu:
cpu 0 hot: high 18, batch 3 used:5
cpu 0 cold: high 6, batch 1 used:5
DMA32 per-cpu: empty
Normal per-cpu: empty
HighMem per-cpu: empty
Free pages: 21028kB (0kB HighMem)
Active:2549 inactive:3856 dirty:0 writeback:0 unstable:0 free:5257
slab:1833 mapped:0 pagetables:0
DMA free:21028kB min:948kB low:1184kB high:1420kB active:10196kB
inactive:15424kB present:56320kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
HighMem free:0kB min:128kB low:128kB high:128kB active:0kB
inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 43*4kB 35*8kB 28*16kB 17*32kB 18*64kB 20*128kB 16*256kB 11*512kB
6*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 21028kB
DMA32: empty
Normal: empty
HighMem: empty
-----------------------------

If there is no page cache, I have another 40Mb to run the test
program. I'm pretty sure the program can work properly at the first
time.

2007-01-11 16:20:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Viktor wrote:
>
> OK, madvise() used with mmap'ed file allows to have reads from a file
> with zero-copy between kernel/user buffers and don't pollute cache
> memory unnecessarily. But how about writes? How is to do zero-copy
> writes to a file and don't pollute cache memory without using O_DIRECT?
> Do I miss the appropriate interface?

mmap()+msync() can do that too.

Also, regular user-space page-aligned data could easily just be moved into
the page cache. We actually have a lot of the infrastructure for it. See
the "splice()" system call. It's just not very widely used, and the
"drop-behind" behaviour (to then release the data) isn't there. And I bet
that there's lots of work needed to make it work well in practice, but
from a conceptual standpoint the O_DIRECT method really is just about the
*worst* way to do things.

O_DIRECT is "simple" in the sense that it basically is a "OS: please just
get out of the way" method. It's why database people like it, and it's why
it has gotten implemented in many operating systems: it *looks* like a
simple interface.

But deep down, O_DIRECT is anything but simple. Trying to do a direct
access with an interface that really isn't designed for it (write()
_fundamentally_ has semantics that do not fit the problem in that you're
supposed to be able to re-use the buffer immediately afterwards in user
space, just as an example) is wrong in the first place, but the really
subtle problems come when you realize that you can't really just "bypass"
the OS.

As a very concrete example: people *think* that they can just bypass the
OS IO layers and just do the write directly. It *sounds* like something
simple and obvious. It sounds like a total no-brainer. Which is exactly
what it is, if by "no-brainer" you mean "only a person without a brain
will do it". Because by-passing the OS has all these subtle effects on
both security and on fundamental correctness.

The whole _point_ of an OS is to be a "resource manager", to make sure
that people cannot walk all over each other, and to be the central point
that makes sure that different people doing allocations and deallocations
don't get confused. In the specific case of a filesystem, it's "trivial"
things like serializing IO, allocating new blocks on the disk, and making
sure that nobody will see the half-way state when the dirty blocks haven't
been written out yet.

O_DIRECT - by bypassing the "real" kernel - very fundamentally breaks the
whole _point_ of the kernel. There's tons of races where an O_DIRECT user
(or other users that expect to see the O_DIRECT data) will now see the
wrong data - including seeign uninitialized portions of the disk etc etc.

In short, the whole "let's bypass the OS" notion is just fundamentally
broken. It sounds simple, but it sounds simple only to an idiot who writes
databases and doesn't even UNDERSTAND what an OS is meant to do. For some
reasons, db people think that they don't need one, and don't ever seem to
really understand the concept fo "security" and "correctness". They
understand it (sometimes) _within_ their own database, but seem to have a
really hard time seeing past their own sandbox.

Some of the O_DIRECT breakage could probably be fixed:

- An O_DIRECT operation must never allocate new blocks on the disk. It's
fundamentally broken. If you *cannot* write new blocks, and can only
read and re-write previous allocations, things are much easier, and a
lot of the races go away.

This is probably _perfectly_ fine for the users (namely databases).
People who do O_DIRECT really expect to see a "raw disk image", but
they (exactly _because_ they expect a raw disk image) are perfectly
happy to "set up" that image beforehand.

- An O_DIRECT operation must never race with any metadata operation,
most notably truncate(), but also any file extension operation like a
normal write() that extends the size of the file.

This should be reasonably easy to do. Any O_DIRECT operation would just
take the inode->i_mutex for reading. HOWEVER. Right now it's a mutex,
not a read-write semaphore, so that is actually pretty painful. But it
would be fairly simple.

With those two rules, a lot of the complexity of the nasty side effects of
O_DIRECT that the db people obviously never even thought about would go
away. We'd still have to have some way to synchronize the page cache, but
it could be as simple as having an O_DIRECT open simply _flush_ the whole
page cache, and set some flag saying "can't do normal opens, we're
exclusively open for O_DIRECT".

I dunno. A lot of filesystems don't want to (or can't) actually do a
"write in place" ANYWAY (writes happen through the log, and hit the "real
filesystem" part of the disk later), and O_DIRECT really only makes sense
if you do the write in place, so the above rules would help make that
obvious too - O_DIRECT really is a totally different thing from a normal
IO, and an O_DIRECT write() or read() really has *nothing* to do with a
regular write() or read() system call.

Overloading a totally different operation with a flag is a bad idea, which
is one reason I really hate O_DIRECT. It's just doing things badly on so
many levels.

Linus

2007-01-11 16:23:54

by bert hubert

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, Jan 11, 2007 at 07:50:26AM -0800, Linus Torvalds wrote:
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
> it sanely. Except by teaching people not to use it, and making the normal

Does this mean that it will eat data today? Or that it is broken because it
requires heaps of work on the kernel side?

If it will eat data, when? What are the issues, cache coherency?

I understand what you say about O_DIRECT, but considering that it is seeing
use today, it would be good to know the extent of the practical problems.

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-01-11 16:45:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Roy Huang wrote:
>
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

I do think that something like this is probably a good idea, even on
non-embedded setups. We historically couldn't do this, because mapped
pages were too damn hard to remove, but that's obviously not much of a
problem any more.

However, the page-cache limit should NOT be some compile-time constant. It
should work the same way the "dirty page" limit works, and probably just
default to "feel free to use 90% of memory for page cache".

Linus

2007-01-11 16:53:56

by Xavier Bestel

[permalink] [raw]
Subject: Re: O_DIRECT question

Le jeudi 11 janvier 2007 ? 07:50 -0800, Linus Torvalds a ?crit :
> > O_DIRECT is still crazily racy versus pagecache operations.
>
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
> it sanely.

How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?

Xav


2007-01-11 17:05:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Xavier Bestel wrote:

> Le jeudi 11 janvier 2007 ? 07:50 -0800, Linus Torvalds a ?crit :
> > > O_DIRECT is still crazily racy versus pagecache operations.
> >
> > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
> > it sanely.
>
> How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?

That is what I think some users could do. If the main issue with O_DIRECT
is the page cache allocations, if we instead had better (read: "any")
support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would
just go away.

See also the patch that Roy Huang posted about another approach to the
same problem: just limiting page cache usage explicitly.

That's not the _only_ issue with O_DIRECT, though. It's one big one, but
people like to think that the memory copy makes a difference when you do
IO too (I think it's likely pretty debatable in real life, but I'm totally
certain you can benchmark it, probably even pretty easily especially if
you have fairly studly IO capabilities and a CPU that isn't quite as
studly).

So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT
picture, and depending on your problems (in this case, the embedded world)
it may even be the *biggest* part. But it's not the whole picture.

Linus

2007-01-11 17:13:54

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Thu, 11 Jan 2007, Viktor wrote:
>> OK, madvise() used with mmap'ed file allows to have reads from a file
>> with zero-copy between kernel/user buffers and don't pollute cache
>> memory unnecessarily. But how about writes? How is to do zero-copy
>> writes to a file and don't pollute cache memory without using O_DIRECT?
>> Do I miss the appropriate interface?
>
> mmap()+msync() can do that too.

It can, somehow... until there's an I/O error. And *that* is just terrbile.

Granted, I didn't check 2.6.x kernels, especially the latest ones. But
in 2.4, if an I/O space behind mmap becomes unREADable, the process gets
stuck in some unkillable state forever. I don't know what happens with
write errors, but that behaviour with read errors is just inacceptable.

Sure it's not something like posix_madvise() (whicih is for reads anyway,
not writes). But I'd very strongly disagree about usage of mmap for
anything more-or-less serious. Because of umm... difficulties with
error recovery (if it's at all possible).

Note also that anything but O_DIRECT isn't... portable. O_DIRECT, with
all its shortcomings and ugliness, works, and works on quite.. some
systems. Having something else, especially with very different usage --
I mean, if the whole I/O subsystem in application has to be redesigned
and re-written in order to use that advanced (or just "right") mechanism
(O_DIRECT is not different from basic read()/write() - just one extra
bit at open() time, and all your code, which evolved during years and
got years of testing, too -- just works, at least in theory, if O_DIRECT
interface is working (ok ok, i know alignment issues, but that's also
handled easily)), -- that'd be somewhat problematic. *Unless* there's
a very noticeable gain from that.

>From my expirience with databases (mostly Oracle, and some with Postgres
and Mysql), O_DIRECT has *dramatic* impact on performance. You don't
use O_DIRECT, and you lose alot. O_DIRECT is *already* a fastest way
possible, I think - for example, it gives maximum speed when writing to
or reading from a raw device (/dev/sdb etc). I don't think there's a
way to improve that performance... Yes, there ARE, it seems, some ways
for improvements, in other areas - like, utilizing write barriers for
example, which isn't quite possible now from userspace. But as long as
O_DIRECT actually writes data before returning from write() call (as it
seems to be the case at least with a normal filesystem on a real block
device - I don't touch corner cases like nfs here), it's pretty much
THE ideal solution, at least from the application (developer) standpoint.

By the way, ext[23]fs is terrible slow with O_DIRECT writes - it gives
about 1/4 of the speed of raw device when multiple concurrent direct
readers and writers are running. Xfs gives full raw device speed here.
I think that MAY be related to locking issues in ext[23], but I don't
know for sure.

And another "btw" - when creating files, O_DIRECT is quite a killer - each
write takes alot more time than "necessary". But once a file has been
written, re-writes are pretty much fast.

Also, and it's quite.. funny (to me at least). Being curious, I compared
write speed (random small-blocks I/O scattered all around the disk) of
modern disk drives with and without write cache (WCE=[01] bit in the
SCSI "Cache control" page of every disk drive). The fun is: with write
cache turned on, actual speed is LOWER than without cache. I don't
remember exact numbers, something like 120mb/sec vs 90mb/sec. And I
think it's quite expectable, as well - first writes all goes to the
cache, but since data stream is going on and on, the cache fills up
quickly, and in order to accept the next data, the drive has to free
some place in its cache. So instead of just doing its work, it is
spending its time to bounce data to/from the cache...

Sure it's not about linux pagecache or something like that, but it's
still somehow related. :)

[]
> O_DIRECT - by bypassing the "real" kernel - very fundamentally breaks the
> whole _point_ of the kernel. There's tons of races where an O_DIRECT user
> (or other users that expect to see the O_DIRECT data) will now see the
> wrong data - including seeign uninitialized portions of the disk etc etc.

Huh? Well, I plug in a shiny new harddisk into my computer, and do an O_DIRECT
read of it - will I see uninitialized data? Sure I will (well, in most cases
the whole disk is filled with zeros anyway, so it isn't uninitialized). The
same applies to regular read, too.

If what you're saying applies to O_DIRECT read of a file on a filesystem, --
well, that's definitely a kernel bug. It should either not allow to read
if the file size isn't sector-aligned - to read that last part which isn't
a whole sector or whatever, -- or it should ensure the "extra" data is
initialized. Yes, that's difficult to implement in the kernel. But it's
not an excuse to not to do that. AND I think just failing the read is
exactly the way to go here.

What about "seeing wrong data" ? Where's that race? Do you mean the case
when one application writes to disk while the other is reading it, so that
it's not obvious which data will be read, the old one or the new one? If
it's the case, just don't worry about that - the same happens with any
variable access in multi-threaded application for example (that's why
locks - mutexes etc - are here). For most serious users of O_DIRECT,
this is no problem at all - for example, Oracle implements its own cache
manager, and all reads and writes goes so that the cache knows what's
going on, which data is being read or written at a given moment and so
on - if it's important anyway.

> In short, the whole "let's bypass the OS" notion is just fundamentally
> broken. It sounds simple, but it sounds simple only to an idiot who writes
> databases and doesn't even UNDERSTAND what an OS is meant to do. For some
> reasons, db people think that they don't need one, and don't ever seem to
> really understand the concept fo "security" and "correctness". They
> understand it (sometimes) _within_ their own database, but seem to have a
> really hard time seeing past their own sandbox.
>
> Some of the O_DIRECT breakage could probably be fixed:
>
> - An O_DIRECT operation must never allocate new blocks on the disk. It's
> fundamentally broken. If you *cannot* write new blocks, and can only
> read and re-write previous allocations, things are much easier, and a
> lot of the races go away.

Not only races, but *terrible* speed too ;) At least on some filesystems.

> This is probably _perfectly_ fine for the users (namely databases).
> People who do O_DIRECT really expect to see a "raw disk image", but
> they (exactly _because_ they expect a raw disk image) are perfectly
> happy to "set up" that image beforehand.

Well, *right now* O_DIRECT is useful (despite of the terrible performance
for new files mentioned above) for things like copying large files/directories
around. If I'm copying a directory tree which doesn't fit in RAM, all the
pagecache gets trashed by a high pressure going on from the copy process.
During that time, the system is just unresponsive, read: unusable. When
I modify `cp' to use O_DIRECT for everything, the process is running in
the background and everything else just works as there was no copy running.

> - An O_DIRECT operation must never race with any metadata operation,
> most notably truncate(), but also any file extension operation like a
> normal write() that extends the size of the file.
>
> This should be reasonably easy to do. Any O_DIRECT operation would just
> take the inode->i_mutex for reading. HOWEVER. Right now it's a mutex,
> not a read-write semaphore, so that is actually pretty painful. But it
> would be fairly simple.

Isn't it will be the reason for slowdown? I'm not a kernel hacker, I don't
know, for example, what's the difference between a mutex and a semaphore... ;)

I once tried to measure concurrent read/write operations against a single
file on a FreeBSD - it just DoesNotScale, exactly - i think - due to some
locking, as it tries to make reads and writes "atomic" (see above - i think
it's due to that "reading data which is being written by another process"
thing). Linux work very well from this standpoint. So I wonder if, by
introducing such a locking, we'll introduce the same "DontScale" behaviour...

BUT - the same rules can be applied to writing, too - I mean, taking some
mutext/whatever wich protects against - say - concurrent ftruncate() or
whatever...

But I can come up with even simpler solution, which MIGHT be acceptable.
Just disallow any - at least write - access to a file which is open in
O_DIRECT mode, IF that other operation isnt' ALSO used with O_DIRECT flag.
I.e, don't allow open(O_TRUNC), ftruncate(), even maybe write(), if another
process has it open with O_DIRECT|O_WRITE.

> With those two rules, a lot of the complexity of the nasty side effects of
> O_DIRECT that the db people obviously never even thought about would go
> away. We'd still have to have some way to synchronize the page cache, but
> it could be as simple as having an O_DIRECT open simply _flush_ the whole
> page cache, and set some flag saying "can't do normal opens, we're
> exclusively open for O_DIRECT".

Yup, like this. But one comment still: normal (non-DIRECT) reads should
be allowed. Needs momre thinking.... The reason is: with that damn oracle,
i can do online backups of tablespaces or the whole database, by saying
"alter tablespace foo beging backup;",
backing up the files, and saying "...end backup;". I'm not sure whenever
during those 'alter tablespace', oracle re-opens the files read/only and
next back read/write. It will not do any writes - that's for sure, but
I don't know if it will reopen r/o.

In any case... mixing direct and non-direct i/o is just not supported, that
is, no words about "consistency", or "atomicity" of reads vs writes etc
(and sure thing - when you're backing up a file which is being modified,
you're screwed by you own anyway - it's an operator error, there's nothing
an OS can do - unless the operator uses some snapshot mechanism...)

> I dunno. A lot of filesystems don't want to (or can't) actually do a
> "write in place" ANYWAY (writes happen through the log, and hit the "real
> filesystem" part of the disk later), and O_DIRECT really only makes sense
> if you do the write in place, so the above rules would help make that
> obvious too - O_DIRECT really is a totally different thing from a normal
> IO, and an O_DIRECT write() or read() really has *nothing* to do with a
> regular write() or read() system call.
>
> Overloading a totally different operation with a flag is a bad idea, which
> is one reason I really hate O_DIRECT. It's just doing things badly on so
> many levels.
>
> Linus

/mjt

2007-01-11 17:33:09

by Alan

[permalink] [raw]
Subject: Re: O_DIRECT question

> space, just as an example) is wrong in the first place, but the really
> subtle problems come when you realize that you can't really just "bypass"
> the OS.

Well you can - its called SG_IO and that really does get the OS out of
the way. O_DIRECT gets crazy when you stop using it on devices directly
and use it on files

You do need some way to avoid the copy cost of caches and get data direct
to user space, it also needs to be a way that works without MMU tricks
because many of that need it are the embedded platforms.

Alan

2007-01-11 18:01:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Alan wrote:
>
> Well you can - its called SG_IO and that really does get the OS out of
> the way. O_DIRECT gets crazy when you stop using it on devices directly
> and use it on files

Well, on a raw disk, O_DIRECT is fine too, but yeah, you might as well
use SG_IO at that point. All of my issues are all about filesystems.

And filesystems is where people use O_DIRECT most. Almost nobody puts
their database on a partition of its own these days, afaik. Perhaps for
benchmarking or some really high-end stuff. Not "normal users".

Linus

2007-01-11 18:41:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 2007-01-11 at 09:04 -0800, Linus Torvalds wrote:
> That is what I think some users could do. If the main issue with O_DIRECT
> is the page cache allocations, if we instead had better (read: "any")
> support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would
> just go away.

For NFS, the main feature of interest when it comes to O_DIRECT is
strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
because it can't guarantee that the page will be thrown out of the page
cache before some second process tries to read it. That is particularly
true if some dopey third party process has mmapped the file.

Trond

2007-01-11 19:01:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, Trond Myklebust wrote:
>
> For NFS, the main feature of interest when it comes to O_DIRECT is
> strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
> because it can't guarantee that the page will be thrown out of the page
> cache before some second process tries to read it. That is particularly
> true if some dopey third party process has mmapped the file.

You'd still be MUCH better off using the page cache, and just forcing the
IO (but _with_ all the page cache synchronization still active). Which is
trivial to do on the filesystem level, especially for something like NFS.

If you bypass the page cache, you just make that "dopey third party
process" problem worse. You now _guarantee_ that there are aliases with
different data.

Of course, with NFS, the _server_ will resolve any aliases anyway, so at
least you don't get file corruption, but you can get some really strange
things (like the write of one process actually happening before, but being
flushed _after_ and overriding the later write of the O_DIRECT process).

And sure, the filesystem can have its own alias avoidance too (by just
probing the page cache all the time), but the fundamental fact remains:
the problem is that O_DIRECT as a page-cache-bypassing mechanism is
BROKEN.

If you have issues with caching (but still have to allow it for other
things), the way to fix them is not to make uncached accesses, it's to
force the cache to be serialized. That's very fundamentally true.

Linus

2007-01-11 19:49:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 2007-01-11 at 11:00 -0800, Linus Torvalds wrote:
>
> On Thu, 11 Jan 2007, Trond Myklebust wrote:
> >
> > For NFS, the main feature of interest when it comes to O_DIRECT is
> > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
> > because it can't guarantee that the page will be thrown out of the page
> > cache before some second process tries to read it. That is particularly
> > true if some dopey third party process has mmapped the file.
>
> You'd still be MUCH better off using the page cache, and just forcing the
> IO (but _with_ all the page cache synchronization still active). Which is
> trivial to do on the filesystem level, especially for something like NFS.
>
> If you bypass the page cache, you just make that "dopey third party
> process" problem worse. You now _guarantee_ that there are aliases with
> different data.

Quite, but that is sometimes an admissible state of affairs.

One of the things that was infuriating when we were trying to do shared
databases over the page cache was that someone would start some
unsynchronised process that had nothing to do with the database itself
(it would typically be a process that was backing up the rest of the
disk or something like that). Said process would end up pinning pages in
memory, and prevented the database itself from getting updated data from
the server.

IOW: the problem was not that of unsynchronised I/O per se. It was
rather that of allowing the application to set up its own
synchronisation barriers and to ensure that no pages are cached across
these barriers. POSIX_FADV_NOREUSE can't offer that guarantee.

> Of course, with NFS, the _server_ will resolve any aliases anyway, so at
> least you don't get file corruption, but you can get some really strange
> things (like the write of one process actually happening before, but being
> flushed _after_ and overriding the later write of the O_DIRECT process).

Writes are not the real problem here since shared databases typically do
implement sufficient synchronisation, and NFS can guarantee that only
the dirty data will be written out. However reading back the data is
problematic when you have insufficient control over the page cache.

The other issue is, of course, that databases don't _want_ to cache the
data in this situation, so the extra copy to the page cache is just a
bother. As you pointed out, that becomes less of an issue as processor
caches and memory speeds increase, but it is still apparently a
measurable effect.

Cheers
Trond

2007-01-11 23:01:39

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Michael Tokarev wrote:
> Linus Torvalds wrote:
>> On Thu, 11 Jan 2007, Viktor wrote:
>>> OK, madvise() used with mmap'ed file allows to have reads from a file
>>> with zero-copy between kernel/user buffers and don't pollute cache
>>> memory unnecessarily. But how about writes? How is to do zero-copy
>>> writes to a file and don't pollute cache memory without using O_DIRECT?
>>> Do I miss the appropriate interface?
>> mmap()+msync() can do that too.
>
> It can, somehow... until there's an I/O error. And *that* is just terrbile.

The other problem besides the inability to handle IO errors is that
mmap()+msync() is synchronous. You need to go async to keep the
pipelines full.

Now if someone wants to implement an aio version of msync and mlock,
that might do the trick. At least for MMU systems. Non MMU systems
just can't play mmap type games.

2007-01-11 23:06:44

by Hua Zhong

[permalink] [raw]
Subject: RE: O_DIRECT question

> The other problem besides the inability to handle IO errors is that
> mmap()+msync() is synchronous. You need to go async to keep
> the pipelines full.

msync(addr, len, MS_ASYNC); doesn't do what you want?

> Now if someone wants to implement an aio version of msync and
> mlock, that might do the trick. At least for MMU systems.
> Non MMU systems just can't play mmap type games.
>

2007-01-12 02:12:44

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/11/07, Roy Huang <[email protected]> wrote:
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

The patch seems to work for me. But some suggestions in my mind:

1) Can we limit the total page cache, not the page cache per each file?
think about if total memory is 128M, 10% of it is 12.8M, here if
one application is running, it can use 12.8M vfs cache, then the
performance will probably not be impacted. However, the current patch
limit the page cache per each file, which means if only one
application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
be small to the application.
------------------snip---------------
if (mapping->nrpages >= mapping->pages_limit)
balance_cache(mapping);
------------------snip---------------

2) A percent number should be better to control the value. Can we add
a proc interface to make the value tunable?

Thanks,
-Aubrey

2007-01-12 02:13:34

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

linux-os (Dick Johnson) wrote:
> On Wed, 10 Jan 2007, Aubrey wrote:
>
>> Hi all,
>>
>> Opening file with O_DIRECT flag can do the un-buffered read/write access.
>> So if I need un-buffered access, I have to change all of my
>> applications to add this flag. What's more, Some scripts like "cp
>> oldfile newfile" still use pagecache and buffer.
>> Now, my question is, is there a existing way to mount a filesystem
>> with O_DIRECT flag? so that I don't need to change anything in my
>> system. If there is no option so far, What is the right way to achieve
>> my purpose?
>>
>> Thanks a lot.
>> -Aubrey
>> -
>
> I don't think O_DIRECT ever did what a lot of folks expect, i.e.,
> write this buffer of data to the physical device _now_. All I/O
> ends up being buffered. The `man` page states that the I/O will
> be synchronous, that at the conclusion of the call, data will have
> been transferred. However, the data written probably will not be
> in the physical device, perhaps only in a DMA-able buffer with
> a promise to get it to the SCSI device, soon.
>

No one (who read the specs) ever though thought the write was "right
now," just that it was direct from user buffers. So it is not buffered,
but it is queued through the elevator.

> Maybe you need to say why you want to use O_DIRECT with its terrible
> performance?

Because it doesn't have terrible performance, because the user knows
better than the o/s what it "right," etc. I used it to eliminate cache
impact from large but non-essential operations, others use it on slow
machines to avoid the CPU impact and bus bandwidth impact of extra copies.

Please don't assume that users are unable to understand how it works
because you believe some other feature which does something else would
be just as good. There is no other option which causes the writes to be
queued right now and not use any cache, and that is sometimes just what
you want.

I do like the patch to limit per-file and per-system cache, though, in
some cases I really would like the system to slow gradually rather than
fill 12GB of RAM with backlogged writes, then queue them and have other
i/o crawl or stop.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-12 02:48:00

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Nick Piggin wrote:
> Aubrey wrote:
>> On 1/11/07, Nick Piggin <[email protected]> wrote:
>
>>> What you _really_ want to do is avoid large mallocs after boot, or use
>>> a CPU with an mmu. I don't think nommu linux was ever intended to be a
>>> simple drop in replacement for a normal unix kernel.
>>
>>
>> Is there a position available working on mmu CPU? Joking, :)
>> Yes, some problems are serious on nommu linux. But I think we should
>> try to fix them not avoid them.
>
> Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE
> mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do
> things
> like limit cache size that are the bandaids.

Tuning the system to work appropriately for a given load is not a
band-aid. I have been saying since 2.5.x times that filling memory with
cached writes was a bad thing, and filling with writes to a single file
was a doubly bad thing. Back in 2.4.NN-aa kernels, there were some
tunables to address that, but other than adding your own 2.6 just
behaves VERY badly for some loads.
>
> Of course, being an embedded system, if they work for you then that's
> really fine and you can obviously ship with them. But they don't need to
> go upstream.
>
Anyone who has a few processes which write a lot of data and many
processes with more modest i/o needs will see the overfilling of cache
with data from one process or even for one file, and the resulting
impact on the performance of all other processes, particularly if the
kernel decides to write all the data for one file at once, because it
avoids seeks, even if it uses the drive for seconds. The code has gone
too far in the direction of throughput, at the expense of response to
other processes, given the (common) behavior noted.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-12 02:48:04

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Aubrey wrote:
> On 1/11/07, Roy Huang <[email protected]> wrote:
>
>> On a embedded systerm, limiting page cache can relieve memory
>> fragmentation. There is a patch against 2.6.19, which limit every
>> opened file page cache and total pagecache. When the limit reach, it
>> will release the page cache overrun the limit.
>
>
> The patch seems to work for me. But some suggestions in my mind:
>
> 1) Can we limit the total page cache, not the page cache per each file?
> think about if total memory is 128M, 10% of it is 12.8M, here if
> one application is running, it can use 12.8M vfs cache, then the
> performance will probably not be impacted. However, the current patch
> limit the page cache per each file, which means if only one
> application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
> be small to the application.
> ------------------snip---------------
> if (mapping->nrpages >= mapping->pages_limit)
> balance_cache(mapping);
> ------------------snip---------------
>
> 2) A percent number should be better to control the value. Can we add
> a proc interface to make the value tunable?

Even a global value isn't completely straightforward, and a per-file value
would be yet more work.

You see, it is hard to do any sort of directed reclaim at these pages.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-12 03:59:46

by Roy Huang

[permalink] [raw]
Subject: Re: O_DIRECT question

Limiting total page cache can be considered first. Only if total page
cache overrun limit, check whether the file overrun its per-file
limit. If it is true, release partial page cache and wake up kswapd at
the same time.

On 1/12/07, Aubrey <[email protected]> wrote:
> On 1/11/07, Roy Huang <[email protected]> wrote:
> > On a embedded systerm, limiting page cache can relieve memory
> > fragmentation. There is a patch against 2.6.19, which limit every
> > opened file page cache and total pagecache. When the limit reach, it
> > will release the page cache overrun the limit.
>
> The patch seems to work for me. But some suggestions in my mind:
>
> 1) Can we limit the total page cache, not the page cache per each file?
> think about if total memory is 128M, 10% of it is 12.8M, here if
> one application is running, it can use 12.8M vfs cache, then the
> performance will probably not be impacted. However, the current patch
> limit the page cache per each file, which means if only one
> application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
> be small to the application.
> ------------------snip---------------
> if (mapping->nrpages >= mapping->pages_limit)
> balance_cache(mapping);
> ------------------snip---------------
>
> 2) A percent number should be better to control the value. Can we add
> a proc interface to make the value tunable?
>
> Thanks,
> -Aubrey
>

2007-01-12 04:31:21

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Bill Davidsen wrote:
> Nick Piggin wrote:
>
>> Aubrey wrote:
>>
>> Exactly, and the *real* fix is to modify userspace not to make >
>> PAGE_SIZE
>> mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do
>> things
>> like limit cache size that are the bandaids.
>
>
> Tuning the system to work appropriately for a given load is not a
> band-aid.

We are talking about about fragmentation. And limiting pagecache to try to
avoid fragmentation is a bandaid, especially when the problem can be solved
(not just papered over, but solved) in userspace.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-12 04:46:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Fri, 12 Jan 2007, Nick Piggin wrote:
>
> We are talking about about fragmentation. And limiting pagecache to try to
> avoid fragmentation is a bandaid, especially when the problem can be solved
> (not just papered over, but solved) in userspace.

It's not clear that the problem _can_ be solved in user space.

It's easy enough to say "never allocate more than a page". But it's often
not REALISTIC.

Very basic issue: the perfect is the enemy of the good. Claiming that
there is a "proper solution" is usually a total red herring. Quite often
there isn't, and the "paper over" is actually not papering over, it's
quite possibly the best solution there is.

Linus

2007-01-12 04:56:55

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Fri, 12 Jan 2007, Nick Piggin wrote:
>
>>We are talking about about fragmentation. And limiting pagecache to try to
>>avoid fragmentation is a bandaid, especially when the problem can be solved
>>(not just papered over, but solved) in userspace.
>
>
> It's not clear that the problem _can_ be solved in user space.
>
> It's easy enough to say "never allocate more than a page". But it's often
> not REALISTIC.
>
> Very basic issue: the perfect is the enemy of the good. Claiming that
> there is a "proper solution" is usually a total red herring. Quite often
> there isn't, and the "paper over" is actually not papering over, it's
> quite possibly the best solution there is.

Yeah *smallish* higher order allocations are fine, and we use them all the
time for things like stacks or networking.

But Aubrey (who somehow got removed from the cc list) wants to do order 9
allocations from userspace in his nommu environment. I'm just trying to be
realistic when I say that this isn't going to be robust and a userspace
solution is needed.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-12 04:58:42

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Nick Piggin wrote:
> Linus Torvalds wrote:

>> Very basic issue: the perfect is the enemy of the good. Claiming that
>> there is a "proper solution" is usually a total red herring. Quite
>> often there isn't, and the "paper over" is actually not papering over,
>> it's quite possibly the best solution there is.
>
>
> Yeah *smallish* higher order allocations are fine, and we use them all the
> time for things like stacks or networking.
>
> But Aubrey (who somehow got removed from the cc list) wants to do order 9
> allocations from userspace in his nommu environment. I'm just trying to be
> realistic when I say that this isn't going to be robust and a userspace
> solution is needed.

Oh, and also: I don't disagree with that limiting pagecache to some %
might be useful for other reasons.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-12 05:19:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Fri, 12 Jan 2007, Nick Piggin wrote:
>
> Yeah *smallish* higher order allocations are fine, and we use them all the
> time for things like stacks or networking.
>
> But Aubrey (who somehow got removed from the cc list) wants to do order 9
> allocations from userspace in his nommu environment. I'm just trying to be
> realistic when I say that this isn't going to be robust and a userspace
> solution is needed.

I do agree that order-9 allocations simply is unlikely to work without
some pre-allocation notion or some serious work at active de-fragmentation
(and the page cache is likely to be the _least_ of the problems people
will hit - slab and other kernel allocations are likely to be much much
harder to handle, since you can't free them in quite as directed a
manner).

But for smallish-order (eg perhaps 3-4 possibly even more if you are
careful in other places), the page cache limiter may well be a "good
enough" solution in practice, especially if other allocations can be
controlled by strict usage patterns (which is not realistic in a general-
purpose kind of situation, but might be realistic in embedded).

Linus

2007-01-12 05:22:17

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/12/07, Nick Piggin <[email protected]> wrote:
> Linus Torvalds wrote:
> >
> > On Fri, 12 Jan 2007, Nick Piggin wrote:
> >
> >>We are talking about about fragmentation. And limiting pagecache to try to
> >>avoid fragmentation is a bandaid, especially when the problem can be solved
> >>(not just papered over, but solved) in userspace.
> >
> >
> > It's not clear that the problem _can_ be solved in user space.
> >
> > It's easy enough to say "never allocate more than a page". But it's often
> > not REALISTIC.
> >
> > Very basic issue: the perfect is the enemy of the good. Claiming that
> > there is a "proper solution" is usually a total red herring. Quite often
> > there isn't, and the "paper over" is actually not papering over, it's
> > quite possibly the best solution there is.
>
> Yeah *smallish* higher order allocations are fine, and we use them all the
> time for things like stacks or networking.
>
> But Aubrey (who somehow got removed from the cc list) wants to do order 9
> allocations from userspace in his nommu environment. I'm just trying to be
> realistic when I say that this isn't going to be robust and a userspace
> solution is needed.
>
Hmm..., aside from big order allocations from user space, if there is
a large application we need to run, it should be loaded into the
memory, so we have to allocate a big block to accommodate it. kernel
fun like load_elf_fdpic_binary() etc will request contiguous memory,
then if vfs eat up free memory, loading fails.

-Aubrey

2007-01-12 07:57:43

by dean gaudet

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thu, 11 Jan 2007, Linus Torvalds wrote:

> On Thu, 11 Jan 2007, Viktor wrote:
> >
> > OK, madvise() used with mmap'ed file allows to have reads from a file
> > with zero-copy between kernel/user buffers and don't pollute cache
> > memory unnecessarily. But how about writes? How is to do zero-copy
> > writes to a file and don't pollute cache memory without using O_DIRECT?
> > Do I miss the appropriate interface?
>
> mmap()+msync() can do that too.
>
> Also, regular user-space page-aligned data could easily just be moved into
> the page cache. We actually have a lot of the infrastructure for it. See
> the "splice()" system call.

it seems to me that if splice and fadvise and related things are
sufficient for userland to take care of things "properly" then O_DIRECT
could be changed into splice/fadvise calls either by a library or in the
kernel directly...

looking at the splice(2) api it seems like it'll be difficult to implement
O_DIRECT pread/pwrite from userland using splice... so there'd need to be
some help there.

i'm probably missing something.

-dean

2007-01-12 14:58:27

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Aubrey wrote:
> On 1/12/07, Nick Piggin <[email protected]> wrote:
>> Linus Torvalds wrote:
>> >
>> > On Fri, 12 Jan 2007, Nick Piggin wrote:
>> >
>> >>We are talking about about fragmentation. And limiting pagecache to
>> try to
>> >>avoid fragmentation is a bandaid, especially when the problem can
>> be solved
>> >>(not just papered over, but solved) in userspace.
>> >
>> >
>> > It's not clear that the problem _can_ be solved in user space.
>> >
>> > It's easy enough to say "never allocate more than a page". But it's
>> often
>> > not REALISTIC.
>> >
>> > Very basic issue: the perfect is the enemy of the good. Claiming that
>> > there is a "proper solution" is usually a total red herring. Quite
>> often
>> > there isn't, and the "paper over" is actually not papering over, it's
>> > quite possibly the best solution there is.
>>
>> Yeah *smallish* higher order allocations are fine, and we use them
>> all the
>> time for things like stacks or networking.
>>
>> But Aubrey (who somehow got removed from the cc list) wants to do
>> order 9
>> allocations from userspace in his nommu environment. I'm just trying
>> to be
>> realistic when I say that this isn't going to be robust and a userspace
>> solution is needed.
>>
> Hmm..., aside from big order allocations from user space, if there is
> a large application we need to run, it should be loaded into the
> memory, so we have to allocate a big block to accommodate it. kernel
> fun like load_elf_fdpic_binary() etc will request contiguous memory,
> then if vfs eat up free memory, loading fails.
Before we had virtual memory we had only a base address register, start
at this location and go thus far, and user program memory had to be
contiguous. To change a program size, all other programs might be moved,
either by memory copy or actual swap to disk if total memory became a
problem. To minimize the pain, programs were loaded at one end of
memory, and system buffers and such were allocated at the other. That
allowed the most recently loaded program the best chance of being able
to grow without thrashing.

The point is that if you want to be able to allocate at all, sometimes
you will have to write dirty pages, garbage collect, and move or swap
programs. The hardware is just too limited to do something less painful,
and the user can't see memory to do things better. Linus is right,
'Claiming that there is a "proper solution" is usually a total red
herring. Quite often there isn't, and the "paper over" is actually not
papering over, it's quite possibly the best solution there is.' I think
any solution is going to be ugly, unfortunately.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-12 15:21:19

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Hua Zhong wrote:
>> The other problem besides the inability to handle IO errors is that
>> mmap()+msync() is synchronous. You need to go async to keep
>> the pipelines full.
>
> msync(addr, len, MS_ASYNC); doesn't do what you want?
>

No, because there is no notification of completion. In fact, does this
call actually even avoid blocking in the current code, while asking the
kernel to flush the pages in the background?

Even if it performs the sync in the background, what about faulting in
the pages to be synced? For instance, if you splice pages from a source
mmaped file into the destination mmap, then msync on the destination,
doesn't the process still block to fault in the source pages?


2007-01-12 15:27:45

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

dean gaudet wrote:
> it seems to me that if splice and fadvise and related things are
> sufficient for userland to take care of things "properly" then O_DIRECT
> could be changed into splice/fadvise calls either by a library or in the
> kernel directly...

No, because the semantics are entirely different. An application using
read/write with O_DIRECT expects read() to block until data is
physically fetched from the device. fadvise() does not FORCE the kernel
to discard cache, it only hints that it should, so a read() or mmap()
very well may reuse a cached page instead of fetching from the disk
again. The application also expects write() to block until the data is
on the disk. In the case of a blocking write, you could splice/msync,
but what about aio?


2007-01-12 16:59:27

by Viktor

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>>OK, madvise() used with mmap'ed file allows to have reads from a file
>>with zero-copy between kernel/user buffers and don't pollute cache
>>memory unnecessarily. But how about writes? How is to do zero-copy
>>writes to a file and don't pollute cache memory without using O_DIRECT?
>>Do I miss the appropriate interface?
>
>
> mmap()+msync() can do that too.

Sorry, I wasn't sufficiently clear. Mmap()+msync() can't be used for
that if data to be written come from some external source, like video
capturing hardware, which DMA'ing data directly into the user space
buffers. Using mmap'ed area for those DMA buffers doesn't look as a good
idea, because, e.g., it will involve unneeded disk reads on the first
page faults.

So, some O_DIRECT-like interface should exist in the system. Also, as
Michael Tokarev noted, operations over mmap'ed areas don't provide good
ways for error handling, which effectively makes them unusable for
something serious.

> Also, regular user-space page-aligned data could easily just be moved into
> the page cache. We actually have a lot of the infrastructure for it. See
> the "splice()" system call. It's just not very widely used, and the
> "drop-behind" behaviour (to then release the data) isn't there. And I bet
> that there's lots of work needed to make it work well in practice, but
> from a conceptual standpoint the O_DIRECT method really is just about the
> *worst* way to do things.

splice() needs 2 file descriptors, but looking at it I've found
vmsplice() syscall, which, seems, can do the needed actions, although
I'm not sure it can work with files and zero-copy. Thanks for pointing
on those interfaces.

2007-01-12 17:03:29

by Viktor

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>>>>O_DIRECT is still crazily racy versus pagecache operations.
>>>
>>>Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
>>>it sanely.
>>
>>How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?
>
>
> That is what I think some users could do. If the main issue with O_DIRECT
> is the page cache allocations, if we instead had better (read: "any")
> support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would
> just go away.
>
> See also the patch that Roy Huang posted about another approach to the
> same problem: just limiting page cache usage explicitly.
>
> That's not the _only_ issue with O_DIRECT, though. It's one big one, but
> people like to think that the memory copy makes a difference when you do
> IO too (I think it's likely pretty debatable in real life, but I'm totally
> certain you can benchmark it, probably even pretty easily especially if
> you have fairly studly IO capabilities and a CPU that isn't quite as
> studly).
>
> So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT
> picture, and depending on your problems (in this case, the embedded world)
> it may even be the *biggest* part. But it's not the whole picture.

>From 2.6.19 sources it looks like POSIX_FADV_NOREUSE is no-op there

> Linus

2007-01-12 18:07:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Thu, 11 Jan 2007, dean gaudet wrote:
>
> it seems to me that if splice and fadvise and related things are
> sufficient for userland to take care of things "properly" then O_DIRECT
> could be changed into splice/fadvise calls either by a library or in the
> kernel directly...

The problem is two-fold:

- the fact that databases use O_DIRECT and all the commercial people are
perfectly happy to use a totally idiotic interface (and they don't care
about the problems) means that things like fadvice() don't actually
get the TLC. For example, the USEONCE thing isn't actually
_implemented_, even though from a design standpoint, it would in many
ways be preferable over O_DIRECT.

It's not just fadvise. It's a general problem for any new interfaces
where the old interfaces "just work" - never mind if they are nasty.
And O_DIRECT isn't actually all that nasty for users (although the
alignment restrictions are obviously irritating, but they are mostly
fundamental _hardware_ alignment restrictions, so..). It's only nasty
from a kernel internal security/serialization standpoint.

So in many ways, apps don't want to change, because they don't really
see the problems.

(And, as seen in this thread: uses like NFS don't see the problems
either, because there the serialization is done entirely somewhere
*else*, so the NFS people don't even understand why the whole interface
sucks in the first place)

- a lot of the reasons for problems for O_DIRECT is the semantics. If we
could easily implement the O_DIRECT semantics using something else, we
would. But it's semantically not allowed to steal the user page, and it
has to wait for it to be all done with, because those are the semantics
of "write()".

So one of the advantages of vmsplice() and friends is literally that it
could allow page stealing, and allow the semantics where any changes to
the page (in user space) might make it to disk _after_ vmsplice() has
actually already returned, because we literally re-use the page (ie
it's fundamentally an async interface).

But again, fadvise and vmsplice etc aren't even getting the attention,
because right now they are only used by small programs (and generally not
done by people who also work on the kernel, and can see that it really
would be better to use more natural interfaces).

> looking at the splice(2) api it seems like it'll be difficult to implement
> O_DIRECT pread/pwrite from userland using splice... so there'd need to be
> some help there.

You'd use vmsplice() to put the write buffers into kernel space (user
space sees it's a pipe file descriptor, but you should just ignore that:
it's really just a kernel buffer). And then splice the resulting kernel
buffers to the destination.

Linus

2007-01-12 20:27:59

by Chris Mason

[permalink] [raw]
Subject: Re: O_DIRECT question

On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote:

> > looking at the splice(2) api it seems like it'll be difficult to implement
> > O_DIRECT pread/pwrite from userland using splice... so there'd need to be
> > some help there.
>
> You'd use vmsplice() to put the write buffers into kernel space (user
> space sees it's a pipe file descriptor, but you should just ignore that:
> it's really just a kernel buffer). And then splice the resulting kernel
> buffers to the destination.

I recently spent some time trying to integrate O_DIRECT locking with
page cache locking. The basic theory is that instead of using
semaphores for solving O_DIRECT vs buffered races, you put something
into the radix tree (I call it a placeholder) to keep the page cache
users out, and lock any existing pages that are present.

O_DIRECT does save cpu from avoiding copies, but it also saves cpu from
fewer radix tree operations during massive IOs. The cost of radix tree
insertion/deletion on 1MB O_DIRECT ios added ~10% system time on
my tiny little dual core box. I'm sure it would be much worse if there
was lock contention on a big numa machine, and it grows as the io grows
(SGI does massive O_DIRECT ios).

To help reduce radix churn, I made it possible for a single placeholder
entry to lock down a range in the radix:

http://thread.gmane.org/gmane.linux.file-systems/12263

It looks to me as though vmsplice is going to have the same issues as my
early patches. The current splice code can avoid the copy but is still
working in page sized chunks. Also, splice doesn't support zero copy on
things smaller than page sized chunks.

The compromise my patch makes is to hide placeholders from almost
everything except the DIO code. It may be worthwhile to turn the
placeholders into an IO marker that can be useful to filemap_fdatawrite
and friends.

It should be able to:

record the userland/kernel pages involved in a given io
map blocks from the FS for making a bio
start the io
wake people up when the io is done

This would allow splice to operate without stealing the userland page
(stealing would still be an option of course), and could get rid of big
chunks of fs/direct-io.c.

-chris

2007-01-12 20:46:22

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Chris Mason wrote:
[]
> I recently spent some time trying to integrate O_DIRECT locking with
> page cache locking. The basic theory is that instead of using
> semaphores for solving O_DIRECT vs buffered races, you put something
> into the radix tree (I call it a placeholder) to keep the page cache
> users out, and lock any existing pages that are present.

But seriously - what about just disallowing non-O_DIRECT opens together
with O_DIRECT ones ?

If the thing will allow non-DIRECT READ-ONLY open, I personally see no
problems whatsoever, at all. If non-DIRECT READONLY open will be disallowed
too, -- well, a bit less nice, but still workable (allowing online backup
of database files opened in O_DIRECT mode using other tools such as `cp' --
if non-direct opens aren't allowed, i'll switch to using dd or somesuch).

Yes there may be still a race between ftruncate() and reads (either direct
or not), or when filling gaps by writing into places which were skipped
by using ftruncate. I don't know how serious those races are.

That to say - if the whole thing will be a bit more strict wrt allowing
set of operations, races (or some of them, anyway) will just go away
(and maybe it will work even better due to quite some code and lock
contention removal), and maybe after that, Linus will like the whole
thing a bit better... ;)

After all the explanations, I still don't see anything wrong with the
interface itself. O_DIRECT isn't "different semantics" - we're still
writing and reading some data. Yes, O_DIRECT and non-O_DIRECT usages
somewhat contradicts with each other, but there are other ways to make
the two happy, instead of introducing alot of stupid, complex, and racy
code all over.

/mjt

2007-01-12 20:52:10

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Michael Tokarev wrote:
[]
> After all the explanations, I still don't see anything wrong with the
> interface itself. O_DIRECT isn't "different semantics" - we're still
> writing and reading some data. Yes, O_DIRECT and non-O_DIRECT usages
> somewhat contradicts with each other, but there are other ways to make
> the two happy, instead of introducing alot of stupid, complex, and racy
> code all over.

By the way. I just ran - for fun - a read test of a raid array.

Reading blocks of size 512kbytes, starting at random places on a 400Gb
array, doing 64threads.

O_DIRECT: 336.73 MB/sec.
!O_DIRECT: 146.00 MB/sec.

Quite a... difference here.

Using posix_fadvice() does not improve it.

/mjt

2007-01-12 21:03:41

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Michael Tokarev wrote:
> Michael Tokarev wrote:
> By the way. I just ran - for fun - a read test of a raid array.
>
> Reading blocks of size 512kbytes, starting at random places on a 400Gb
> array, doing 64threads.
>
> O_DIRECT: 336.73 MB/sec.
> !O_DIRECT: 146.00 MB/sec.

And when turning off read-ahead, the speed dropped to 30 MB/sec. Read-ahead
should not help here, I think... But after analyzing the "randomness" a bit,
it turned out alot of requests are coming to places "near" the ones which has
been read recently. After switching to another random number generator,
speed in a case WITH readahead enabled dropped to almost 5Mb/sec ;)

And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
load - because everything is thrown away from the cache, even caches of /bin
/usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot).

(No, really - this load isn't entirely synthetic. It's a typical database
workload - random I/O all over, on a large file. If it can, it combines
several I/Os into one, by requesting more than a single block at a time,
but overall it is random.)

/mjt

2007-01-12 21:18:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Sat, 13 Jan 2007, Michael Tokarev wrote:
>
> (No, really - this load isn't entirely synthetic. It's a typical database
> workload - random I/O all over, on a large file. If it can, it combines
> several I/Os into one, by requesting more than a single block at a time,
> but overall it is random.)

My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without
having all the BAD behaviour that O_DIRECT adds.

For example, just the requirement that O_DIRECT can never create a file
mapping, and can never interact with ftruncate would actually make
O_DIRECT a lot more palatable to me. Together with just the requirement
that an O_DIRECT open would literally disallow any non-O_DIRECT accesses,
and flush the page cache entirely, would make all the aliases go away.

At that point, O_DIRECT would be a way of saying "we're going to do
uncached accesses to this pre-allocated file". Which is a half-way
sensible thing to do.

But what O_DIRECT does right now is _not_ really sensible, and the
O_DIRECT propeller-heads seem to have some problem even admitting that
there _is_ a problem, because they don't care.

A lot of DB people seem to simply not care about security or anything
else.anything else. I'm trying to tell you that quoting numbers is
pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

I can calculate PI to a billion decimal places in my head in .1 seconds.
If you don't care about the CORRECTNESS of the result, that is.

See? It's not about performance. It's about O_DIRECT being fundamentally
broken as it behaves right now.

Linus

2007-01-12 21:54:31

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
[]
> My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without
> having all the BAD behaviour that O_DIRECT adds.

*This* point I got from the beginning, once I tried to think how it all
is done internally (I never thought about that, because I'm not a kernel
hacker to start with) -- currently, linux has ugly/racy places which are
either difficult or impossible to fix, all due to this O_DIRECT thing
which iteracts badly with other access "methods".

> For example, just the requirement that O_DIRECT can never create a file
> mapping, and can never interact with ftruncate would actually make
> O_DIRECT a lot more palatable to me. Together with just the requirement
> that an O_DIRECT open would literally disallow any non-O_DIRECT accesses,
> and flush the page cache entirely, would make all the aliases go away.
>
> At that point, O_DIRECT would be a way of saying "we're going to do
> uncached accesses to this pre-allocated file". Which is a half-way
> sensible thing to do.

Half-way?

> But what O_DIRECT does right now is _not_ really sensible, and the
> O_DIRECT propeller-heads seem to have some problem even admitting that
> there _is_ a problem, because they don't care.

Well. In fact, there's NO problems to admit.

Yes, yes, yes yes - when you think about it from a general point of
view, and think how non-O_DIRECT and O_DIRECT access fits together,
it's a complete mess, and you're 100% right it's a mess.

But. Those damn "database people" don't mix and match the two accesses
together (I'm not one of them, either - I'm just trying to use a DB
product on linux). So there's just no issue. The solution to in-kernel
races and problems in this case is the usage scenario, and in following
simple usage rules. Basically, the above requiriment - "don't mix&match
the two together" - is implemented in userspace (yes, there's no guarantee
that someone/thing will not do some evil thing, but that's controlled by
file permisions). That is, database software itself will not try to use
the thing in a wrong way. Simple as that.

> A lot of DB people seem to simply not care about security or anything
> else.anything else. I'm trying to tell you that quoting numbers is
> pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

When done properly - be it in user- or kernel-space, it IS correct. No
database people are ftruncating() a file *and* reading from the past-end
of it at the same time for example, and don't mix-n-match cached and direct
io, at least not for the same part of a file (if there are, they're really
braindead, or it's just a plain bug).

> I can calculate PI to a billion decimal places in my head in .1 seconds.
> If you don't care about the CORRECTNESS of the result, that is.
>
> See? It's not about performance. It's about O_DIRECT being fundamentally
> broken as it behaves right now.

I recall again the above: the actual USAGE of O_DIRECT, as implemented
in database software, tries to ensure there's no brokeness, especially
fundamental brokeness, just by not performing parallel direct/non-direct
read/writes/truncates. This way, the thing Just Works, works *correctly*
(provided there's no bugs all the way down to a device), *and* works *fast*.

By the way, I can think of some useful cases where *parts* of a file are
mmap()ed (even for RW access), and parts are being read/written with O_DIRECT.
But that's probably some corner cases.

/mjt

2007-01-12 22:01:50

by Zan Lynx

[permalink] [raw]
Subject: Disk Cache, Was: O_DIRECT question

On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote:
[snip]
> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
> load - because everything is thrown away from the cache, even caches of /bin
> /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot).

One thing that I've been using, and seems to work well, is a customized
version of the readahead program several distros use during boot up.

Mine starts off doing:
mlockall(MCL_CURRENT|MCL_FUTURE);
...yadda, yadda...

and for each file listed:
...open, stat stuff...
if( NULL == mmap(
NULL, stat_buf.st_size,
PROT_READ, MAP_SHARED|MAP_LOCKED|MAP_POPULATE,
fd, 0)
) {
fprintf(stderr, "'%s' ", file);
perror("mmap");
}
...more stuff...
and then ends with:
pause();
and it sits there forever.

As far as I can tell, this makes the program and library code stay in
RAM. At least, after a drop_caches nautilus doesn't load 12 MB off
disk, it just starts. It has to be reloaded after software updates and
after prelinking. I find the 250 MB used to be worthwhile, even if its
kinda Windowsey.

Something like that could keep your system responsive no matter what the
disk cache is doing otherwise.
--
Zan Lynx <[email protected]>


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-01-12 22:11:07

by Michael Tokarev

[permalink] [raw]
Subject: Re: Disk Cache, Was: O_DIRECT question

Zan Lynx wrote:
> On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote:
> [snip]
>> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
>> load - because everything is thrown away from the cache, even caches of /bin
>> /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot).
>
> One thing that I've been using, and seems to work well, is a customized
> version of the readahead program several distros use during boot up.

[idea to lock some (commonly-used) cache pages in memory]

> Something like that could keep your system responsive no matter what the
> disk cache is doing otherwise.

Unfortunately it's not. Sure, things like libc.so etc will be force-cached
and will start fast. But not my data files and other stuff (what an
unfortunate thing: memory usually is smaller in size than disks ;)

I can do usual work without noticing something's working with the disks
intensively, doing O_DIRECT I/O. For example, I can run large report on
a database, which requires alot of disk I/O, and run a kernel compile at
the same time. Sure, disk access is alot slower, but disk cache helps alot,
too. My kernel compile will not be much slower than usual. But if I'll
turn O_DIRECT off, the compile will take ages to finish. *And* the report
running, too! Because the system tries hard to cache the WRONG pages!
(yes I remember fadvise &Co - which aren't used by the database(s) currently,
and quite alot of words has been said about that, too; I also noticied it's
slower as well, at least currently.)

/mjt

2007-01-12 22:11:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT question



On Sat, 13 Jan 2007, Michael Tokarev wrote:
> >
> > At that point, O_DIRECT would be a way of saying "we're going to do
> > uncached accesses to this pre-allocated file". Which is a half-way
> > sensible thing to do.
>
> Half-way?

I suspect a lot of people actually have other reasons to avoid caches.

For example, the reason to do O_DIRECT may well not be that you want to
avoid caching per se, but simply because you want to limit page cache
activity. In which case O_DIRECT "works", but it's really the wrong thing
to do. We could export other ways to do what people ACTUALLY want, that
doesn't have the downsides.

For example, the page cache is absolutely required if you want to mmap.
There's no way you can do O_DIRECT and mmap at the same time and expect
any kind of sane behaviour. It may not be what a DB wants to use, but it's
an example of where O_DIRECT really falls down.

> > But what O_DIRECT does right now is _not_ really sensible, and the
> > O_DIRECT propeller-heads seem to have some problem even admitting that
> > there _is_ a problem, because they don't care.
>
> Well. In fact, there's NO problems to admit.
>
> Yes, yes, yes yes - when you think about it from a general point of
> view, and think how non-O_DIRECT and O_DIRECT access fits together,
> it's a complete mess, and you're 100% right it's a mess.

You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually
fails in many ways right now.

I've already mentioned ftruncate and block allocation. You don't seem to
understand that those are ALSO a problem.

Linus

2007-01-12 22:26:48

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Sat, 13 Jan 2007, Michael Tokarev wrote:
>>> At that point, O_DIRECT would be a way of saying "we're going to do
>>> uncached accesses to this pre-allocated file". Which is a half-way
>>> sensible thing to do.
>> Half-way?
>
> I suspect a lot of people actually have other reasons to avoid caches.
>
> For example, the reason to do O_DIRECT may well not be that you want to
> avoid caching per se, but simply because you want to limit page cache
> activity. In which case O_DIRECT "works", but it's really the wrong thing
> to do. We could export other ways to do what people ACTUALLY want, that
> doesn't have the downsides.
>
> For example, the page cache is absolutely required if you want to mmap.
> There's no way you can do O_DIRECT and mmap at the same time and expect
> any kind of sane behaviour. It may not be what a DB wants to use, but it's
> an example of where O_DIRECT really falls down.

Provided when the two are about the same part of a file. If not, and if
the file is "divided" on a proper boundary (sector/page/whatever-aligned),
there's no issues, at least not if all the blocks of a file has been allocated
(no gaps, that is).

What I was referring to in my last email - and said it's a corner case - is:
mmap() start of a file, say, first megabyte of it, where some index/bitmap is
located, and use direct-io on the rest. So the two aren't overlap.

Still problematic?

>>> But what O_DIRECT does right now is _not_ really sensible, and the
>>> O_DIRECT propeller-heads seem to have some problem even admitting that
>>> there _is_ a problem, because they don't care.
>> Well. In fact, there's NO problems to admit.
>>
>> Yes, yes, yes yes - when you think about it from a general point of
>> view, and think how non-O_DIRECT and O_DIRECT access fits together,
>> it's a complete mess, and you're 100% right it's a mess.
>
> You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually
> fails in many ways right now.
>
> I've already mentioned ftruncate and block allocation. You don't seem to
> understand that those are ALSO a problem.

I do understand this. And this is, too, solved right now in userspace.
For example, when oracle allocates a file for its data, or when it extends
the file, it writes something to every block of new space (using O_DIRECT
while at it, but that's a different story). The thing is: while it is doing
that, no process tries to do anything with that (part of a) file (not counting
some external processes run by evil hackers ;) So there's still no races
or fundamental brokeness *in usage*.

It uses ftruncate() to create or extend a file, *and* does O_DIRECT writes
to force block allocations. That's probably not right, and that alone is
probably difficult to implement in kernel (I just don't know; what I know
for sure is that this way is very slow on ext3). Maybe because there's no
way to tell kernel something like "set the file size to this and actually
*allocate* space for it" (if it doesn't write some structure to the file).

What I dislike very much is - half-solutions. And current O_DIRECT indeed
looks like half-a-solution, because sometimes it works, and sometimes, in
*wrong* usage scenario, it doesn't, or racy, etc, and kernel *allows* such
a wrong scenario. A software should either work correctly, or disallow
a usage where it can't guarantee correctness. Currently, kernel allows
incorrect usage, and that, plus all the ugly things in code done in attempt
to fix that, suxx.

But the whole thing is not (fundamentally) broken.

/mjt

2007-01-12 22:35:13

by Erik Andersen

[permalink] [raw]
Subject: Re: O_DIRECT question

On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> I suspect a lot of people actually have other reasons to avoid caches.
>
> For example, the reason to do O_DIRECT may well not be that you want to
> avoid caching per se, but simply because you want to limit page cache
> activity. In which case O_DIRECT "works", but it's really the wrong thing
> to do. We could export other ways to do what people ACTUALLY want, that
> doesn't have the downsides.

I was rather fond of the old O_STREAMING patch by Robert Love,
which added an open() flag telling the kernel to not keep data
from the current file in cache by dropping pages from the
pagecache before the current index. O_STREAMING was very nice
for when you know you want to read a large file sequentially
without polluting the rest of the cache with GB of data that you
plan on only read once and discard. It worked nicely at doing
what many people want to use O_DIRECT for.

Using O_STREAMING you would get normal read/write semantics since
you still had the pagecache caching your data, but only the
not-yet-written write-behind data and the not-yet-read read-ahead
data. With the additional hint the kernel should drop free-able
pages from the pagecache behind the current position, because we
know we will never want them again. I thought that was a very
nice way of handling things.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2007-01-12 22:49:10

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT question

On Fri, 12 Jan 2007 15:35:09 -0700
Erik Andersen <[email protected]> wrote:

> On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > I suspect a lot of people actually have other reasons to avoid caches.
> >
> > For example, the reason to do O_DIRECT may well not be that you want to
> > avoid caching per se, but simply because you want to limit page cache
> > activity. In which case O_DIRECT "works", but it's really the wrong thing
> > to do. We could export other ways to do what people ACTUALLY want, that
> > doesn't have the downsides.
>
> I was rather fond of the old O_STREAMING patch by Robert Love,

That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it
up to dehackify it and get it into mainline, but we ended up deciding that
posix_fadvise() was the way to go because it's standards-based.

It's a bit more work in the app to use posix_fadvise() well. But the
results will be better. The app should also use sync_file_range()
intelligently to control its pagecache use.

The problem with all of these things is that the application needs to be
changed, and people often cannot do that. If we want a general way of
stopping particular apps from swamping pagecache then it'd really need to
be an externally-imposed thing - probably via additional accounting and a
new rlimit.

2007-01-13 04:52:10

by Nick Piggin

[permalink] [raw]
Subject: Re: O_DIRECT question

Bill Davidsen wrote:

> The point is that if you want to be able to allocate at all, sometimes
> you will have to write dirty pages, garbage collect, and move or swap
> programs. The hardware is just too limited to do something less painful,
> and the user can't see memory to do things better. Linus is right,
> 'Claiming that there is a "proper solution" is usually a total red
> herring. Quite often there isn't, and the "paper over" is actually not
> papering over, it's quite possibly the best solution there is.' I think
> any solution is going to be ugly, unfortunately.

It seems quite robust and clean to me, actually. Any userspace memory
that absolutely must be large contiguous regions have to be allocated at
boot or from a pool reserved at boot. All other allocations can be broken
into smaller ones.

Write dirty pages, garbage collect, move or swap programs isn't going
to be robust because there is lots of vital kernel memory that cannot be
moved and will cause fragmentation.

The reclaimable zone work that went on a while ago for hugepages is
exactly how you would also fix this problem and still have a reasonable
degree of flexibility at runtime. It isn't really ugly or hard, compared
with some of the non-working "solutions" that have been proposed.

The other good thing is that the core mm already has practically
everything required, so the functionality is unintrusive.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-13 16:55:36

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds <[email protected]> wrote:
> On Sat, 13 Jan 2007, Michael Tokarev wrote:

>> (No, really - this load isn't entirely synthetic. It's a typical database
>> workload - random I/O all over, on a large file. If it can, it combines
>> several I/Os into one, by requesting more than a single block at a time,
>> but overall it is random.)
>
> My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without
> having all the BAD behaviour that O_DIRECT adds.
>
> For example, just the requirement that O_DIRECT can never create a file
> mapping,

That sounds sane, but the video streaming folks will be unhappy.

Maybe you could do:
reserve_space(); (*)
do_write_odirect();
update_filesize();
and only allow reads up to the current filesize?

Off cause if you do ftruncate first and then write o_direct, the holes will
need to be filled before the corresponding blocks are assigned to the file.
Either you'll zero them or you can insert them into the file after the write.

Races:
against other reads: May happen in any order, to-be-written pages are
beyond filesize (inaccessible), zeroed or not yet assigned to the file.
against other writes: No bad effect, since you don't unreserve
mappings, and update_filesize won't shrink the file. You must, however,
not reserve two chunks for the same location in the file unless you can
handle replacing blocks of files.
open(O_WRITE) without O_DIRECT is not allowed, therefore that can't race.
against truncate: Yes, see below

(*) This would allow fadvise_size(), too, which could reduce fragmentation
(and give an early warning on full disks) without forcing e.g. fat to
zero all blocks. OTOH, fadvise_size() would allow users to reserve the
complete disk space without his filesizes reflecting this.

> and can never interact with ftruncate

ACK, r/w semaphore, read={r,w}_odirect, write=ftruncate?

> would actually make
> O_DIRECT a lot more palatable to me. Together with just the requirement
> that an O_DIRECT open would literally disallow any non-O_DIRECT accesses,
> and flush the page cache entirely, would make all the aliases go away.

That's probably the best semantics.

Maybe you should allow O_READ for the backup people, maybe forcing
O_DIRECT|O_ALLOWDOUBLEBUFFER (doing the extra copy in the kernel).

> At that point, O_DIRECT would be a way of saying "we're going to do
> uncached accesses to this pre-allocated file". Which is a half-way
> sensible thing to do.

And I'd bet nobody would notice these changes unless they try inherently
stupid things.

> But what O_DIRECT does right now is _not_ really sensible, and the
> O_DIRECT propeller-heads seem to have some problem even admitting that
> there _is_ a problem, because they don't care.

It's a hammer - having it will make anything look like a nail,
and there is nothing wrong with hammering a nail!!! .-)

> A lot of DB people seem to simply not care about security or anything
> else.anything else. I'm trying to tell you that quoting numbers is
> pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

The only thing you'll need for a correct database behaviour is:
If one process has completed it's write and the next process opens that
file, it must read the current contents.

Races with normal reads and writes, races with truncate - don't do that then.
You wouldn't expect "cat somefile > database.dat" on a running db to be a
good thing, too, no matter if o_direct is used or not.
--
Funny quotes:
3. On the other hand, you have different fingers.

Fri?, Spammer: [email protected]

2007-01-13 19:29:08

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Bodo Eggert wrote:

> (*) This would allow fadvise_size(), too, which could reduce fragmentation
> (and give an early warning on full disks) without forcing e.g. fat to
> zero all blocks. OTOH, fadvise_size() would allow users to reserve the
> complete disk space without his filesizes reflecting this.

Please clarify how this would interact with quota, and why it wouldn't
allow someone to run me out of disk.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-13 20:06:43

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Linus Torvalds wrote:
>
> On Sat, 13 Jan 2007, Michael Tokarev wrote:
>> (No, really - this load isn't entirely synthetic. It's a typical database
>> workload - random I/O all over, on a large file. If it can, it combines
>> several I/Os into one, by requesting more than a single block at a time,
>> but overall it is random.)
>
> My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without
> having all the BAD behaviour that O_DIRECT adds.
>
> For example, just the requirement that O_DIRECT can never create a file
> mapping, and can never interact with ftruncate would actually make
> O_DIRECT a lot more palatable to me. Together with just the requirement
> that an O_DIRECT open would literally disallow any non-O_DIRECT accesses,
> and flush the page cache entirely, would make all the aliases go away.
>
> At that point, O_DIRECT would be a way of saying "we're going to do
> uncached accesses to this pre-allocated file". Which is a half-way
> sensible thing to do.

But it's not necessary, it would break existing programs, would be
incompatible with other o/s like AIX, BSD, Solaris. And it doesn't
provide the legitimate use for O_DIRECT in avoiding cache pollution when
writing a LARGE file.
>
> But what O_DIRECT does right now is _not_ really sensible, and the
> O_DIRECT propeller-heads seem to have some problem even admitting that
> there _is_ a problem, because they don't care.

You say that as if it were a failing. Currently if you mix access via
O_DIRECT and non-DIRECT you can get unexpected results. You can screw
yourself, mangle your data, or have no problems at all if you avoid
trying to access the same bytes in multiple ways. There are lots of ways
to get or write stale data, not all involve O_DIRECT in any way, and the
people actually using O_DIRECT now are managing very well.

I don't regard it as a system failing that I am allowed to shoot myself
in the foot, it's one of the benefits of Linux over Windows. Using
O_DIRECT now is like being your own lawyer, room for both creativity and
serious error. But what's there appears portable, which is important as
well.

I do have one thought, WRT reading uninitialized disk data. I would hope
that sparse files are handled right, and that when doing a write with
O_DIRECT the metadata is not updated until the write is done.
>
> A lot of DB people seem to simply not care about security or anything
> else.anything else. I'm trying to tell you that quoting numbers is
> pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

The guiding POSIX standard appears dead, and major DB programs which
work on Linux run on AIX, Solaris, and BSD. That sounds like a good
level of compatibility. I'm not sure what more correctness you would
want beyond a proposed standard and common practice. It's tricky to use,
like many other neat features.

I xonfess I have abused O_DIRECT by opening a file with O_DIRECT,
fdopen()ing it for C, supplying my own large aligned buffer, and using
that with an otherwise unmodified large program which uses fprintf().
That worked on all of the major UNIX variants as well.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-13 20:27:28

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Bill Davidsen wrote:
> Linus Torvalds wrote:
>>
[]
>> But what O_DIRECT does right now is _not_ really sensible, and the
>> O_DIRECT propeller-heads seem to have some problem even admitting that
>> there _is_ a problem, because they don't care.
>
> You say that as if it were a failing. Currently if you mix access via
> O_DIRECT and non-DIRECT you can get unexpected results. You can screw
> yourself, mangle your data, or have no problems at all if you avoid
> trying to access the same bytes in multiple ways. There are lots of ways
> to get or write stale data, not all involve O_DIRECT in any way, and the
> people actually using O_DIRECT now are managing very well.
>
> I don't regard it as a system failing that I am allowed to shoot myself
> in the foot, it's one of the benefits of Linux over Windows. Using
> O_DIRECT now is like being your own lawyer, room for both creativity and
> serious error. But what's there appears portable, which is important as
> well.

If I got it right (and please someone tell me if I *really* got it right!),
the problem is elsewhere.

Suppose you have a filesystem, not at all related to databases and stuff.
Your usual root filesystem, with your /etc/ /var and so on directories.

Some time ago you edited /etc/shadow, updating it by writing new file and
renaming it to proper place. So you have that old content of your shadow
file (now deleted) somewhere on the disk, but not accessible from the
filesystem.

Now, a bad guy deliberately tries to open some file on this filesystem, using
O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and
at the same time tries to use O_DIRECT read of the data.

Due to all the races etc, it is possible for him to read that old content of
/etc/shadow file you've deleted before.

> I do have one thought, WRT reading uninitialized disk data. I would hope
> that sparse files are handled right, and that when doing a write with
> O_DIRECT the metadata is not updated until the write is done.

"hope that sparse files are handled right" is a high hope. Exactly because
this very place IS racy.

Again, *IF* I got it correctly.

/mjt

2007-01-14 09:11:13

by Nate Diller

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/12/07, Andrew Morton <[email protected]> wrote:
> On Fri, 12 Jan 2007 15:35:09 -0700
> Erik Andersen <[email protected]> wrote:
>
> > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > > I suspect a lot of people actually have other reasons to avoid caches.
> > >
> > > For example, the reason to do O_DIRECT may well not be that you want to
> > > avoid caching per se, but simply because you want to limit page cache
> > > activity. In which case O_DIRECT "works", but it's really the wrong thing
> > > to do. We could export other ways to do what people ACTUALLY want, that
> > > doesn't have the downsides.
> >
> > I was rather fond of the old O_STREAMING patch by Robert Love,
>
> That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it
> up to dehackify it and get it into mainline, but we ended up deciding that
> posix_fadvise() was the way to go because it's standards-based.
>
> It's a bit more work in the app to use posix_fadvise() well. But the
> results will be better. The app should also use sync_file_range()
> intelligently to control its pagecache use.

and there's an interesting note that i should add here, cause there's
a downside to using fadvise() instead of O_STREAM when the programmer
is not careful. I spent at least a month doing some complex blktrace
analysis to try to figure out why Digeo's new platform (which used the
fadvise() call) didn't have the kind of streaming performance that it
should have. One symptom I found was that even on the media partition
where I/O should have always been happening in nice 512K chunks
(ra_pages == 128), it seemed to be happening in random values between
32K and 512K. It turns out that the code pulls in some size chunk,
maybe 32K, then does an fadvise DONTNEED on the fd, *with zero offset
and zero length*, meaning that it wipes out *all* the pagecache for
the file. That means that the rest of the 512K from the readahead
would get discarded before it got used, and later the remaining pages
in the ra window would get faulted in again.

Most applications don't get the kind of performance analysis that
Digeo was doing, and even then, it's rather lucky that we caught that.
So I personally think it'd be best for libc or something to simulate
the O_STREAM behavior if you ask for it. That would simplify things
for the most common case, and have the side benefit of reducing the
amount of extra code an application would need in order to take
advantage of that feature.

NATE

2007-01-14 15:39:41

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Michael Tokarev wrote:
> Bill Davidsen wrote:

> If I got it right (and please someone tell me if I *really* got it right!),
> the problem is elsewhere.
>
> Suppose you have a filesystem, not at all related to databases and stuff.
> Your usual root filesystem, with your /etc/ /var and so on directories.
>
> Some time ago you edited /etc/shadow, updating it by writing new file and
> renaming it to proper place. So you have that old content of your shadow
> file (now deleted) somewhere on the disk, but not accessible from the
> filesystem.
>
> Now, a bad guy deliberately tries to open some file on this filesystem, using
> O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and
> at the same time tries to use O_DIRECT read of the data.

Which should be identified and zeros returned. Consider: I open a file
for database use, and legitimately seek to a location out at, say,
250MB, and then write at the location my hash says I should. That's all
legitimate. Now when some backup program accesses the file sequentially,
it gets a boatload of zeros, because Linux "knows" that is sparse data.
Yes, the backup program should detect this as well, so what?

My point is, that there is code to handle sparse data now, without
O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem
with the idea of O_DIRECT, the kernel has a security problem.
>
> Due to all the races etc, it is possible for him to read that old content of
> /etc/shadow file you've deleted before.
>
>> I do have one thought, WRT reading uninitialized disk data. I would hope
>> that sparse files are handled right, and that when doing a write with
>> O_DIRECT the metadata is not updated until the write is done.
>
> "hope that sparse files are handled right" is a high hope. Exactly because
> this very place IS racy.

Other than assuring that a program can't read where no program has
written, I don't see a problem. Anyone accessing the same file with
multiple processes had better be doing user space coordination, and gets
no sympathy from me if they don't. In this case, "works right" does not
mean "works as expected," because the program has no right to assume the
kernel will sort out poor implementations.

Without O_DIRECT the problem of doing ordered i/o in user space becomes
very difficult, if not impossible, so "get rid of O_DIRECT" is the wrong
direction. When the program can be sure the i/o is done, then cleverness
in user space can see that it's done RIGHT.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-01-14 18:56:41

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sat, 13 Jan 2007, Bill Davidsen wrote:

> Bodo Eggert wrote:
>
> > (*) This would allow fadvise_size(), too, which could reduce fragmentation
> > (and give an early warning on full disks) without forcing e.g. fat to
> > zero all blocks. OTOH, fadvise_size() would allow users to reserve the
> > complete disk space without his filesizes reflecting this.
>
> Please clarify how this would interact with quota, and why it wouldn't
> allow someone to run me out of disk.

I fell into the "write-will-never-fail"-pit. Therefore I have to talk
about the original purpose, write with O_DIRECT, too.

- Reserved blocks should be taken out of the quota, since they are about
to be written right now. If you emptied your quota doing this, it's
your fault. It it was the group's quota, just run fast enough.-)

- If one write failed that extended the reserved range, the reserved area
should be shrunk again. Obviously you'll need something clever here.
* You can't shrink carelessly while there are O_DIRECT writes.
* You can't just try to grab the semaphore[0] for writing, this will
deadlock with other write()s.
* If you drop the read lock, it will work out, because you aren't
writing anymore, and if you get the write lock, there won't be anybody
else writing. Therefore you can clear the reservation for the not-
written blocks. You may unreserve blocks that should stay reserved,
but that won't harm much. At worst, you'll get fragmentation, loss
of speed and an aborted (because of no free space) write command.
Document this, it's a feature.-)

- If you fadvise_size on a non-quota-disk, you can possibly reserve it
completely, without being the easy-to-spot offender. You can do the
same by actually writing these files, keeping them open and unlinking
them. The new quality is: You can't just look at the file sizes in
/proc in order to spot the offender. However, if you reflect the
reserved blocks in the used-blocks-field of struct stat, du will
work as expected and the BOFH will know whom to LART.

BTW: If the fs supports holes, using du would be the right thing
to do anyway.


BTW2: I don't know if reserving without actually assigning blocks is
supported or easy to support at all. These reservations are the result of
"These blocks are not yet written, therefore they contain possibly secret
data that would leak on failed writes, therefore they may not be actually
assigned to the file before write finishes. They may not be on the free
list either. And hey, if we support pre-reserving blocks to the file, we
may additionally use it for fadvise_size. I'll mention that briefly."




[0] r/w semaphore, read={r,w}_odirect, write=ftruncate

--
Fun things to slip into your budget
Paradigm pro-activator (a whole pack)
(you mean beer?)

2007-01-14 19:40:30

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

Bill Davidsen <[email protected]> wrote:

> My point is, that there is code to handle sparse data now, without
> O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem
> with the idea of O_DIRECT, the kernel has a security problem.

The idea of O_DIRECT is to bypass the pagecache, and the pagecache is what
provides the security against reading someone else's data using sparse
files or partial-block-IO.

2007-01-15 12:13:40

by Helge Hafting

[permalink] [raw]
Subject: Re: O_DIRECT question

Michael Tokarev wrote:
> Chris Mason wrote:
> []
>
>> I recently spent some time trying to integrate O_DIRECT locking with
>> page cache locking. The basic theory is that instead of using
>> semaphores for solving O_DIRECT vs buffered races, you put something
>> into the radix tree (I call it a placeholder) to keep the page cache
>> users out, and lock any existing pages that are present.
>>
>
> But seriously - what about just disallowing non-O_DIRECT opens together
> with O_DIRECT ones ?
>
Please do not create a new local DOS attack.
I open some important file, say /etc/resolv.conf
with O_DIRECT and just sit on the open handle.
Now nobody else can open that file because
it is "busy" with O_DIRECT ?

Helge Hafting

2007-01-16 03:45:24

by Jörn Engel

[permalink] [raw]
Subject: Re: O_DIRECT question

On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote:
>
> Yes for desktop, server, but maybe not for embedded system, specially
> for no-mmu linux. In many embedded system cases, the whole system is
> running in the ram, including file system. So it's not necessary using
> page cache anymore. Page cache can't improve performance on these
> cases, but only fragment memory.

You were not very specific, so I have to guess that you're referring to
the problem of having two copies of the same file in RAM - one in the
page cache and one in the "backing store", which is just RAM.

There are two solutions to this problem. One is tmpfs, which doesn't
use a backing store and keeps all data in the page cache. The other is
xip, which doesn't use the page cache and goes directly to backing
store. Unlike O_DIRECT, xip only works with a RAM or de-facto RAM
backing store (NOR flash works read-only).

So if you really care about memory waste in embedded systems, you should
have a look at mm/filemap_xip.c and continue Carsten Otte's work.

Jörn

--
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein

2007-01-16 20:32:46

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

Helge Hafting <[email protected]> wrote:
> Michael Tokarev wrote:

>> But seriously - what about just disallowing non-O_DIRECT opens together
>> with O_DIRECT ones ?
>>
> Please do not create a new local DOS attack.
> I open some important file, say /etc/resolv.conf
> with O_DIRECT and just sit on the open handle.
> Now nobody else can open that file because
> it is "busy" with O_DIRECT ?

Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?
--
"Unix policy is to not stop root from doing stupid things because
that would also stop him from doing clever things." - Andi Kleen

"It's such a fine line between stupid and clever" - Derek Smalls

2007-01-17 04:29:31

by Aubrey Li

[permalink] [raw]
Subject: Re: O_DIRECT question

On 1/12/07, Linus Torvalds <[email protected]> wrote:
>
>
> On Thu, 11 Jan 2007, Roy Huang wrote:
> >
> > On a embedded systerm, limiting page cache can relieve memory
> > fragmentation. There is a patch against 2.6.19, which limit every
> > opened file page cache and total pagecache. When the limit reach, it
> > will release the page cache overrun the limit.
>
> I do think that something like this is probably a good idea, even on
> non-embedded setups. We historically couldn't do this, because mapped
> pages were too damn hard to remove, but that's obviously not much of a
> problem any more.
>
> However, the page-cache limit should NOT be some compile-time constant. It
> should work the same way the "dirty page" limit works, and probably just
> default to "feel free to use 90% of memory for page cache".
>
> Linus
>

The attached patch limit the page cache by a simple way:

1) If request memory from page cache, Set a flag to mark this kind of
allocation:

static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

2) Have zone_watermark_ok done this limit:

+ if (alloc_flags & ALLOC_PAGECACHE){
+ min = min + VFS_CACHE_LIMIT;
+ }
+
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;

3) So, when __alloc_pages is called by page cache, pass the
ALLOC_PAGECACHE into get_page_from_freelist to trigger the pagecache
limit branch in zone_watermark_ok.

This approach works on my side, I'll make a new patch to make the
limit tunable in the proc fs soon.

The following is the patch:
=====================================================
Index: mm/page_alloc.c
===================================================================
--- mm/page_alloc.c (revision 2645)
+++ mm/page_alloc.c (working copy)
@@ -892,6 +892,9 @@ failed:
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#define ALLOC_PAGECACHE 0x80 /* __GFP_PAGECACHE set */
+
+#define VFS_CACHE_LIMIT 0x400 /* limit VFS cache page */

/*
* Return 1 if free pages are above 'mark'. This takes into account the order
@@ -910,6 +913,10 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;

+ if (alloc_flags & ALLOC_PAGECACHE){
+ min = min + VFS_CACHE_LIMIT;
+ }
+
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
for (o = 0; o < order; o++) {
@@ -1000,8 +1007,12 @@ restart:
return NULL;
}

- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ if (gfp_mask & __GFP_PAGECACHE)
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+ zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_PAGECACHE);
+ else
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+ zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1027,6 +1038,9 @@ restart:
if (wait)
alloc_flags |= ALLOC_CPUSET;

+ if (gfp_mask & __GFP_PAGECACHE)
+ alloc_flags |= ALLOC_PAGECACHE;
+
/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
* coming from realtime tasks go deeper into reserves.
Index: include/linux/gfp.h
===================================================================
--- include/linux/gfp.h (revision 2645)
+++ include/linux/gfp.h (working copy)
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use
emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce
hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_PAGECACHE ((__force gfp_t)0x80000u) /* Is page cache
allocation ? */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
Index: include/linux/pagemap.h
===================================================================
--- include/linux/pagemap.h (revision 2645)
+++ include/linux/pagemap.h (working copy)
@@ -62,7 +62,7 @@ static inline struct page *__page_cache_

static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
=====================================================

Welcome any comments and suggestions,

Thanks,
-Aubrey


Attachments:
(No filename) (4.58 kB)
vfscache.diff (2.75 kB)
Download all attachments

2007-01-17 05:58:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: O_DIRECT question

On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote:
> Helge Hafting <[email protected]> wrote:
> > Michael Tokarev wrote:
>
> >> But seriously - what about just disallowing non-O_DIRECT opens together
> >> with O_DIRECT ones ?
> >>
> > Please do not create a new local DOS attack.
> > I open some important file, say /etc/resolv.conf
> > with O_DIRECT and just sit on the open handle.
> > Now nobody else can open that file because
> > it is "busy" with O_DIRECT ?
>
> Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?

.. then any user can impact the operation, performance and reliability
of the database application of another user... sounds like plugging one
hole by making a bigger hole ;)


2007-01-17 14:42:31

by Alex Tomas

[permalink] [raw]
Subject: Re: O_DIRECT question


I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.

thanks, Alex

> Side note: the only reason O_DIRECT exists is because database people are
> too used to it, because other OS's haven't had enough taste to tell them
> to do it right, so they've historically hacked their OS to get out of the
> way.

> As a result, our madvise and/or posix_fadvise interfaces may not be all
> that strong, because people sadly don't use them that much. It's a sad
> example of a totally broken interface (O_DIRECT) resulting in better
> interfaces not getting used, and then not getting as much development
> effort put into them.

> So O_DIRECT not only is a total disaster from a design standpoint (just
> look at all the crap it results in), it also indirectly has hurt better
> interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and
> clean interface to make sure we don't pollute memory unnecessarily with
> cached pages after they are all done) ends up being a no-op ;/

> Sad. And it's one of those self-fulfilling prophecies. Still, I hope some
> day we can just rip the damn disaster out.

2007-01-17 22:44:14

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

On Tue, 16 Jan 2007, Arjan van de Ven wrote:
> On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote:
> > Helge Hafting <[email protected]> wrote:
> > > Michael Tokarev wrote:

> > >> But seriously - what about just disallowing non-O_DIRECT opens together
> > >> with O_DIRECT ones ?
> > >>
> > > Please do not create a new local DOS attack.
> > > I open some important file, say /etc/resolv.conf
> > > with O_DIRECT and just sit on the open handle.
> > > Now nobody else can open that file because
> > > it is "busy" with O_DIRECT ?
> >
> > Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?
>
> .. then any user can impact the operation, performance and reliability
> of the database application of another user... sounds like plugging one
> hole by making a bigger hole ;)

Don't allow other users to access your raw database files then, and if
backup kicks in, pausing the database would DTRT for integrety of the
backup. For other applications, paused O_DIRECT may very well be a
problem, but I can't think of one right now.

--
Logic: The art of being wrong with confidence...

2007-01-20 16:21:09

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thursday 11 January 2007 16:50, Linus Torvalds wrote:
>
> On Thu, 11 Jan 2007, Nick Piggin wrote:
> >
> > Speaking of which, why did we obsolete raw devices? And/or why not just
> > go with a minimal O_DIRECT on block device support? Not a rhetorical
> > question -- I wasn't involved in the discussions when they happened, so
> > I would be interested.
>
> Lots of people want to put their databases in a file. Partitions really
> weren't nearly flexible enough. So the whole raw device or O_DIRECT just
> to the block device thing isn't really helping any.
>
> > O_DIRECT is still crazily racy versus pagecache operations.
>
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix
> it sanely. Except by teaching people not to use it, and making the normal
> paths fast enough (and that _includes_ doing things like dropping caches
> more aggressively, but it probably would include more work on the device
> queue merging stuff etc etc).

What will happen if we just make open ignore O_DIRECT? ;)

And then anyone who feels sad about is advised to do it
like described here:

http://lkml.org/lkml/2002/5/11/58
--
vda

2007-01-20 16:38:21

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> example, which isn't quite possible now from userspace. But as long as
> O_DIRECT actually writes data before returning from write() call (as it
> seems to be the case at least with a normal filesystem on a real block
> device - I don't touch corner cases like nfs here), it's pretty much
> THE ideal solution, at least from the application (developer) standpoint.

Why do you want to wait while 100 megs of data are being written?
You _have to_ have threaded db code in order to not waste
gobs of CPU time on UP + even with that you eat context switch
penalty anyway.

I hope you agree that threaded code is not ideal performance-wise
- async IO is better. O_DIRECT is strictly sync IO.
--
vda

2007-01-20 16:47:35

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sunday 14 January 2007 10:11, Nate Diller wrote:
> On 1/12/07, Andrew Morton <[email protected]> wrote:
> Most applications don't get the kind of performance analysis that
> Digeo was doing, and even then, it's rather lucky that we caught that.
> So I personally think it'd be best for libc or something to simulate
> the O_STREAM behavior if you ask for it. That would simplify things
> for the most common case, and have the side benefit of reducing the
> amount of extra code an application would need in order to take
> advantage of that feature.

Sounds like you are saying that making O_DIRECT really mean
O_STREAM will work for everybody (including db people,
except that they will moan a lot about "it isn't _real_ O_DIRECT!!!
Linux suxxx"). I don't care about that.
--
vda

2007-01-20 20:55:32

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
>> example, which isn't quite possible now from userspace. But as long as
>> O_DIRECT actually writes data before returning from write() call (as it
>> seems to be the case at least with a normal filesystem on a real block
>> device - I don't touch corner cases like nfs here), it's pretty much
>> THE ideal solution, at least from the application (developer) standpoint.
>
> Why do you want to wait while 100 megs of data are being written?
> You _have to_ have threaded db code in order to not waste
> gobs of CPU time on UP + even with that you eat context switch
> penalty anyway.

Usually it's done using aio ;)

It's not that simple really.

For reads, you have to wait for the data anyway before doing something
with it. Omiting reads for now.

For writes, it's not that problematic - even 10-15 threads is nothing
compared with the I/O (O in this case) itself -- that context switch
penalty.

> I hope you agree that threaded code is not ideal performance-wise
> - async IO is better. O_DIRECT is strictly sync IO.

Hmm.. Now I'm confused.

For example, oracle uses aio + O_DIRECT. It seems to be working... ;)
As an alternative, there are multiple single-threaded db_writer processes.
Why do you say O_DIRECT is strictly sync?

In either case - I provided some real numbers in this thread before.
Yes, O_DIRECT has its problems, even security problems. But the thing
is - it is working, and working WAY better - from the performance point
of view - than "indirect" I/O, and currently there's no alternative that
works as good as O_DIRECT.

Thanks.

/mjt

2007-01-20 23:07:31

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >> example, which isn't quite possible now from userspace. But as long as
> >> O_DIRECT actually writes data before returning from write() call (as it
> >> seems to be the case at least with a normal filesystem on a real block
> >> device - I don't touch corner cases like nfs here), it's pretty much
> >> THE ideal solution, at least from the application (developer) standpoint.
> >
> > Why do you want to wait while 100 megs of data are being written?
> > You _have to_ have threaded db code in order to not waste
> > gobs of CPU time on UP + even with that you eat context switch
> > penalty anyway.
>
> Usually it's done using aio ;)
>
> It's not that simple really.
>
> For reads, you have to wait for the data anyway before doing something
> with it. Omiting reads for now.

Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.

This way read() returns almost immediately and CPU is free to do
something useful.

> For writes, it's not that problematic - even 10-15 threads is nothing
> compared with the I/O (O in this case) itself -- that context switch
> penalty.

Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have "clever writes" like Linus described.

/* something like "caching i/o over this fd is mostly useless" */
/* (looks like this API is easier to transition to
* than fadvise etc. - it's "looks like" O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
* marks buf100meg's pages R/O to catch modifications,
* but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();

> > I hope you agree that threaded code is not ideal performance-wise
> > - async IO is better. O_DIRECT is strictly sync IO.
>
> Hmm.. Now I'm confused.
>
> For example, oracle uses aio + O_DIRECT. It seems to be working... ;)
> As an alternative, there are multiple single-threaded db_writer processes.
> Why do you say O_DIRECT is strictly sync?

I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.

> In either case - I provided some real numbers in this thread before.
> Yes, O_DIRECT has its problems, even security problems. But the thing
> is - it is working, and working WAY better - from the performance point
> of view - than "indirect" I/O, and currently there's no alternative that
> works as good as O_DIRECT.

Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda

2007-01-21 12:09:58

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
>> Denis Vlasenko wrote:
>>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
>>>> example, which isn't quite possible now from userspace. But as long as
>>>> O_DIRECT actually writes data before returning from write() call (as it
>>>> seems to be the case at least with a normal filesystem on a real block
>>>> device - I don't touch corner cases like nfs here), it's pretty much
>>>> THE ideal solution, at least from the application (developer) standpoint.
>>> Why do you want to wait while 100 megs of data are being written?
>>> You _have to_ have threaded db code in order to not waste
>>> gobs of CPU time on UP + even with that you eat context switch
>>> penalty anyway.
>> Usually it's done using aio ;)
>>
>> It's not that simple really.
>>
>> For reads, you have to wait for the data anyway before doing something
>> with it. Omiting reads for now.
>
> Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
> idea here: http://lkml.org/lkml/2002/5/11/58
> In short, page-aligned read buffer can be just unmapped,
> with page fault handler catching accesses to yet-unread data.
> As data comes from disk, it gets mapped back in process'
> address space.

> This way read() returns almost immediately and CPU is free to do
> something useful.

And what the application does during that page fault? Waits for the read
to actually complete? How it's different from a regular (direct or not)
read?

Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
for the read to complete. And also, now we're in an unknown-corner-case when
an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
this looks more like mmap than like actual read).

Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
all those rases and design stupidity etc. Yes it may work, provided those
"corner cases" like I/O errors problems will be fixed. And yes, sometimes
it's not really that interesting to know when exactly we'll sleep actually
waiting for the I/O - during read or during some memory access...

Now I wonder how it should look like from an applications standpoint. It
has its "smart" cache. A worker thread (process in case of oracle - there's
a very good reason why they don't use threads, and this architecture saved
our data several times already - but that's entirely different topic and
not really relevant here) -- so, a worker process which executes requests
coming from a user application wants to have (read) access a db block
(usually 8Kb in size, but can be 4..32Kb - definitely not 100megs), where
the requested data is located. It checks whenever this block is in cache,
and if it's not, it is being read from the disk and added to the cache.
The cache resides in a shared memory (so that other processes will be able
to access it too).

With the proposed solution, it looks even better - that `read()' operation
which returns immediately, so all other processes which wants the same page
at the same time will start "using" it immediately. Provided they all can
access the memory.

This is how a (large) index access or table-access-by-rowid (after index lookup
for example) is done - requesting usually just a single block in some random
place of a file.

There's another access pattern - like, full table scans, where alot of data
is being read sequentially. It's done in chunks, say, 64 blocks (8Kb each)
at a time. We read a chunk of data, do some thing on it, and discard it
(caching it isn't a very good idea). For this access pattern, the proposal
should work fairy well. Except of the I/O errors handling maybe.

By the way - the *whole* cache thing may be implemented in application
*using in-kernel page cache*, with clever usage of mmap() and friends.
Provided the whole database fits into an address space, or something like
that ;)

>> For writes, it's not that problematic - even 10-15 threads is nothing
>> compared with the I/O (O in this case) itself -- that context switch
>> penalty.
>
> Well, if you have some CPU intensive thing to do (e.g. sort),
> why not benefit from lack of extra context switch?

There may be other reasons to "want" those extra context switches.
I mentioned above that oracle doesn't use threads, but processes.
I don't know why exactly it's done this way, but I know how it saved
our data. The short answer is this: bugs ;) A process doing somethin
with the data and generates write requests to the db goes crazy - some
memory corruption, doing some bad things... But that process does not
do any writes directly - instead, it generates those write requests
in shared memory, and ANOTHER process actually does the writing. AND
verifies that the requests actually look sanely. And detects the "bad"
writes, and immediately prevents data corruption. That other (dbwr)
process does much simpler things, and has its own address space which
isn't accessible by that crazy one.

> Assume that we have "clever writes" like Linus described.
>
> /* something like "caching i/o over this fd is mostly useless" */
> /* (looks like this API is easier to transition to
> * than fadvise etc. - it's "looks like" O_DIRECT) */
> fd = open(..., flags|O_STREAM);
> ...
> /* Starts writeout immediately due to O_STREAM,
> * marks buf100meg's pages R/O to catch modifications,
> * but doesn't block! */
> write(fd, buf100meg, 100*1024*1024);

And how do we know when the write completes?

> /* We are free to do something useful in parallel */
> sort();

.. which is done in another process, already started.

>>> I hope you agree that threaded code is not ideal performance-wise
>>> - async IO is better. O_DIRECT is strictly sync IO.
>> Hmm.. Now I'm confused.
>>
>> For example, oracle uses aio + O_DIRECT. It seems to be working... ;)
>> As an alternative, there are multiple single-threaded db_writer processes.
>> Why do you say O_DIRECT is strictly sync?
>
> I mean that O_DIRECT write() blocks until I/O really is done.
> Normal write can block for much less, or not at all.

So we either move that blocking into something like fdatasync_area(),
requiring two syscalls ([m]write and fdatasync_area) instead of just
one, or use async notifications (kevent anyone? ;) when the queued
writes completes. The latter is probably more interesting.

Here again, we'll have two issues: that same error handling (when
the write fails due to bad disk etc), and a new one - ordering.
Which - probably - isn't an issue, I'm not sure.

>> In either case - I provided some real numbers in this thread before.
>> Yes, O_DIRECT has its problems, even security problems. But the thing
>> is - it is working, and working WAY better - from the performance point
>> of view - than "indirect" I/O, and currently there's no alternative that
>> works as good as O_DIRECT.
>
> Why we bothered to write Linux at all?
> There were other Unixes which worked ok.

Denis, please realize - I'm not an "oracle guy" (or "database guy" or whatever).
I'm not really a developer even - mostly a sysadmin (but I wrote some software,
and designed some (fairy small) APIs too - not kernel work, but still). I have
some good expirience with running oracle stuff (and by the way, I hate it due to
many different reasons, but due to other, including historical (there's many code
which has been written and which works with this damn thing) reasons I'm sorta
stuck with it for some (more) time).

I tried to understand how it all works at an application (oracle) level. I know
that *now* *in linux* O_DIRECT IS here, and orrible uses it, and without it the
whole thing is just damn slow. After this thread I also know that O_DIRECT has
some... issues (I'm still unsure if I got it right), and that there are other
possible solutions to it. I also knew that using O_DIRECT (again, as implemented
already) is a way to work around some other issues (like, f.e, when copying a
data from one disk to another using dd - without O_DIRECT, the system almost
stops working during all the copy operation, and the copy is very slow; but
with O_DIRECT, it's both faster and allows system to work the same as before
during the copy).

I spent quite alot of time watching how the damn thing (orrible db, that is)
works, and tuning it, and trying various storage settings, various filesystems
etc. I didn't like all this. But I become curious after all, because this
O_DIRECT thing behaves very differently on different filesystems and storage
configurations. That's probably a reason why I has been Cc'ed in this thread
in the first place (I posted some questions about O_DIRECT to LKML before),
and why I'm replying, too ;)

I'm not arguing against better interfaces. Not at all. More, I spent quite
some time designing interfaces for my software, to be as much "pretty" as it
ever can be (both easy to use AND flexible, so that common tasks can be done
with easy, and all the "advanced" tasks will be possible too - the two contradicts
with each other in many cases) - I dislike "non-pretty" interfaces (which I've
seen alot, too, in other code ;)

The thing is - I don't know how it's all done in the kernel -- only some guesses.

Don't try to shot me - I'm the wrong person to do so.

With this orrible thing - it works now using O_DIRECT. Works on many different
platforms. It's a thing which predates linux, by the way. It's important on
the market, and is being used on many "big" places. For a long time. With all
the lack of good taste of it, etc. (I'm not sure it was oracle who "invented"
O_DIRECT, i think they're used existing inteface, it turned out to be good for
something, and other OSes followed, incl. linux, and it started being used for
other things too).

How about talking with those people (there are someone from oracle here on LKML)
about all this for example? With postgresql people?

By the way, it's not only a question about prettines of an interface. It's also
something to do with existing code and architecture.

/mjt

2007-01-21 20:04:31

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sunday 21 January 2007 13:09, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >>>> example, which isn't quite possible now from userspace. But as long as
> >>>> O_DIRECT actually writes data before returning from write() call (as it
> >>>> seems to be the case at least with a normal filesystem on a real block
> >>>> device - I don't touch corner cases like nfs here), it's pretty much
> >>>> THE ideal solution, at least from the application (developer) standpoint.
> >>> Why do you want to wait while 100 megs of data are being written?
> >>> You _have to_ have threaded db code in order to not waste
> >>> gobs of CPU time on UP + even with that you eat context switch
> >>> penalty anyway.
> >> Usually it's done using aio ;)
> >>
> >> It's not that simple really.
> >>
> >> For reads, you have to wait for the data anyway before doing something
> >> with it. Omiting reads for now.
> >
> > Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
> > idea here: http://lkml.org/lkml/2002/5/11/58
> > In short, page-aligned read buffer can be just unmapped,
> > with page fault handler catching accesses to yet-unread data.
> > As data comes from disk, it gets mapped back in process'
> > address space.
>
> > This way read() returns almost immediately and CPU is free to do
> > something useful.
>
> And what the application does during that page fault? Waits for the read
> to actually complete? How it's different from a regular (direct or not)
> read?

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use signals.
Do we have this 'something'? I honestly don't know).

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".

> Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
> for the read to complete. And also, now we're in an unknown-corner-case when
> an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
> this looks more like mmap than like actual read).
>
> Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
> all those rases and design stupidity etc. Yes it may work, provided those
> "corner cases" like I/O errors problems will be fixed.

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

> And yes, sometimes
> it's not really that interesting to know when exactly we'll sleep actually
> waiting for the I/O - during read or during some memory access...

It differs from performance perspective, as dicussed above.

> There may be other reasons to "want" those extra context switches.
> I mentioned above that oracle doesn't use threads, but processes.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

> > Assume that we have "clever writes" like Linus described.
> >
> > /* something like "caching i/o over this fd is mostly useless" */
> > /* (looks like this API is easier to transition to
> > * than fadvise etc. - it's "looks like" O_DIRECT) */
> > fd = open(..., flags|O_STREAM);
> > ...
> > /* Starts writeout immediately due to O_STREAM,
> > * marks buf100meg's pages R/O to catch modifications,
> > * but doesn't block! */
> > write(fd, buf100meg, 100*1024*1024);
>
> And how do we know when the write completes?
>
> > /* We are free to do something useful in parallel */
> > sort();
>
> .. which is done in another process, already started.

You think "Oracle". But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.

> > Why we bothered to write Linux at all?
> > There were other Unixes which worked ok.
>
> Denis, please realize - I'm not an "oracle guy" (or "database guy" or whatever).
> I'm not really a developer even - mostly a sysadmin (but I wrote some software,
> and designed some (fairy small) APIs too - not kernel work, but still). I have
> some good expirience with running oracle stuff (and by the way, I hate it due to
> many different reasons, but due to other, including historical (there's many code
> which has been written and which works with this damn thing) reasons I'm sorta
> stuck with it for some (more) time).

Well, I too currently work with Oracle.
Apparently people who wrote damn thing have very, eh, Oracle-centric
world-view. "We want direct writes to the disk. Period." Why? Does it
makes sense? Are there better ways? - nothing. They think they know better.

(And let's not even start on why oracle ignores SIGTERM. Apparently Unix
rules aren't for them. They're too big to play by rules.)
--
vda

2007-01-22 01:47:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT question

Hello everyone,

This is a long thread about O_DIRECT surprisingly without a single
bugreport in it, that's a good sign that O_DIRECT is starting to work
well in 2.6 too ;)

On Fri, Jan 12, 2007 at 02:47:48PM -0800, Andrew Morton wrote:
> On Fri, 12 Jan 2007 15:35:09 -0700
> Erik Andersen <[email protected]> wrote:
>
> > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > > I suspect a lot of people actually have other reasons to avoid caches.
> > >
> > > For example, the reason to do O_DIRECT may well not be that you want to
> > > avoid caching per se, but simply because you want to limit page cache
> > > activity. In which case O_DIRECT "works", but it's really the wrong thing
> > > to do. We could export other ways to do what people ACTUALLY want, that
> > > doesn't have the downsides.
> >
> > I was rather fond of the old O_STREAMING patch by Robert Love,
>
> That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it
> up to dehackify it and get it into mainline, but we ended up deciding that
> posix_fadvise() was the way to go because it's standards-based.
>
> It's a bit more work in the app to use posix_fadvise() well. But the
> results will be better. The app should also use sync_file_range()
> intelligently to control its pagecache use.
>
> The problem with all of these things is that the application needs to be
> changed, and people often cannot do that. If we want a general way of

And if the application needs to be changed then IMHO it sounds better
to go the last mile and to use O_DIRECT instead of O_STREAMING to run
in zerocopy. Benchmarks have been posted here as well to show what a
kind of difference O_DIRECT can make. O_STREAMING really shouldn't
exist and all O_STREAMING users should be converted to
O_DIRECT.

The only reason O_DIRECT exists is to bypass the pagecache and to run
in zerocopy, to avoid all pagecache lookups and locking, to preserve
cpu caches, to avoid losing smp scalability in the memory bus in
not-numa systems, and to avoid the general cpu overhead of copying the
data with the cpu for no good reason. The cache polluting avoidance
that O_STREAMING and fadvise can also provide, is an almost not
interesting feature.

I'm afraid databases aren't totally stupid here using O_DIRECT, the
caches they keep in ram isn't necessarily always a 1:1 mapping of the
on-disk data, so replacing O_DIRECT with a MAP_SHARED of the source
file, wouldn't be the best even if they could be convinced to trust
the OS instead of insisting to bypass it (and if they could combine
MAP_SHARED with asyncio somehow). They don't have problems to trust
the OS when they map tmpfs as MAP_SHARED after all... Why to waste
time copying the data through pagecache if the pagecache itself won't
be useful when the db is properly tuned?

Linus may be right that perhaps one day the CPU will be so much faster
than disk that such a copy will not be measurable and then O_DIRECT
could be downgraded to O_STREAMING or an fadvise. If such a day will
come by, probably that same day Dr. Tanenbaum will be finally right
about his OS design too.

Storage speed is growing along cpu speeds, especially with contiguous
I/O and by using fast raid storage, so I don't see it very likely that
we can ignore those memory copies any time soon. Perhaps an average
amd64 desktop system with a single sata disk may never get a real
benefit from O_DIRECT compared to O_STREAMING, but that's not the
point as linux doesn't only run on desktops with a single SATA disk
running at only 50M/sec (and abysmal performance while seeking).

With regard to the locking mess, O_DIRECT already fallback to buffered
mode while creating new blocks and uses proper locking to serialize
against i_size changes (by sct). filling holes and i_size changes are
the forbidden sins of O_DIRECT. The rest is just a matter of cache
invalidates or cache flushes run at the right time.

With more recent 2.6 changes, even further complexity has been
introduced to allow mapped cache to see O_DIRECT writes, I've never
been convinced that this was really useful. There was nothing wrong in
having a not uptodate page mapped in userland (except to workaround an
artifical BUG_ON that tried to enforce that artificial invariant for
no apparent required reason), but it should work ok and it can be seen
as a new feature.

2007-01-22 15:52:46

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> What will happen if we just make open ignore O_DIRECT? ;)
>
> And then anyone who feels sad about is advised to do it
> like described here:
>
> http://lkml.org/lkml/2002/5/11/58

Then database and other high performance IO users will be broken. Most
of Linus's rant there is being rehashed now in this thread, and it has
been pointed out that using mmap instead is unacceptable because it is
inherently _synchronous_ and the app can not tolerate the page faults on
read, and handling IO errors during the page fault is impossible/highly
problematic.


2007-01-22 15:58:20

by Al Boldi

[permalink] [raw]
Subject: Re: O_DIRECT question

Andrea Arcangeli wrote:
> Linus may be right that perhaps one day the CPU will be so much faster
> than disk that such a copy will not be measurable and then O_DIRECT
> could be downgraded to O_STREAMING or an fadvise. If such a day will
> come by, probably that same day Dr. Tanenbaum will be finally right
> about his OS design too.

Dr. T. is probably right with his OS design, it's just people aren't ready
for it, yet.


Thanks!

--
Al

2007-01-22 16:17:45

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> The difference is that you block exactly when you try to access
> data which is not there yet, not sooner (potentially much sooner).
>
> If application (e.g. database) needs to know whether data is _really_ there,
> it should use aio_read (or something better, something which doesn't use signals.
> Do we have this 'something'? I honestly don't know).

The application _IS_ using aio, which is why it can go and perform other
work while it waits to be told that the read has completed. This is not
possible with mmap because the task is blocked while faulting in pages,
and unless it tries to access the pages, they won't be faulted in.

> In some cases, evne this is not needed because you don't have any other
> things to do, so you just do read() (which returns early), and chew on
> data. If your CPU is fast enough and processing of data is light enough
> so that it outruns disk - big deal, you block in page fault handler
> whenever a page is not read for you in time.
> If CPU isn't fast enough, your CPU and disk subsystem are nicely working
> in parallel.

Being blocked in the page fault handler means the cpu is now idle
because you can't go chew on data that _IS_ in core. The aio + O_DIRECT
allows you to control when IO is started rather than rely on the kernel
to decide when is a good time for readahead, and to KNOW when that IO is
done so you can chew on the data.

> With O_DIRECT, you alternate:
> "CPU is idle, disk is working" / "CPU is working, disk is idle".

You have this completely backwards. With mmap this is what you get
because you chew data, page fault... chew data... page fault...

> What do you want to do on I/O error? I guess you cannot do much -
> any sensible db will shutdown itself. When your data storage
> starts to fail, it's pointless to continue running.

Ever hear of error recovery? A good db will be able to cope with one or
two bad blocks, or at the very least continue operating the other tables
or databases it is hosting, or flush transactions and switch to read
only mode, or any number of things other than abort().

> You do not need to know which read() exactly failed due to bad disk.
> Filename and offset from the start is enough. Right?
>
> So, SIGIO/SIGBUS can provide that, and if your handler is of
> void (*sa_sigaction)(int, siginfo_t *, void *);
> style, you can get fd, memory address of the fault, etc.
> Probably kernel can even pass file offset somewhere in siginfo_t...

Sure... now what does your signal handler have to do in order to handle
this error in such a way as to allow the one request to be failed and
the task to continue handling other requests? I don't think this is
even possible, yet alone clean.

> You can still be multithreaded. The point is, with O_DIRECT
> you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

Or use aio. Simple read/write with the kernel trying to outsmart the
application is nice for very simple applications, but it does not
provide very good performance. This is why we have aio and O_DIRECT;
because the application can manage the IO better than the kernel because
it actually knows what it needs and when.

Yes, the application ends up being more complex, but that is the price
you pay. You simply can't get it perfect in a general purpose kernel
that has to guess what the application is really trying to do.

> You think "Oracle". But this application may very well be
> not Oracle, but diff, or dd, or KMail. I don't want to care.
> I want all big writes to be efficient, not just those done by Oracle.
> *Including* single threaded ones.

Then redesign those applications to use aio and O_DIRECT. Incidentally
I have hacked up dd to do just that and have some very nice performance
numbers as a result.

> Well, I too currently work with Oracle.
> Apparently people who wrote damn thing have very, eh, Oracle-centric
> world-view. "We want direct writes to the disk. Period." Why? Does it
> makes sense? Are there better ways? - nothing. They think they know better.

Nobody has shown otherwise to date.


2007-01-24 21:18:06

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Monday 22 January 2007 17:17, Phillip Susi wrote:
> > You do not need to know which read() exactly failed due to bad disk.
> > Filename and offset from the start is enough. Right?
> >
> > So, SIGIO/SIGBUS can provide that, and if your handler is of
> > void (*sa_sigaction)(int, siginfo_t *, void *);
> > style, you can get fd, memory address of the fault, etc.
> > Probably kernel can even pass file offset somewhere in siginfo_t...
>
> Sure... now what does your signal handler have to do in order to handle
> this error in such a way as to allow the one request to be failed and
> the task to continue handling other requests? I don't think this is
> even possible, yet alone clean.

Actually, you have convinced me on this. While it's is possible
to report error to userspace, it will be highly nontrivial (read:
bug-prone) for userspace to catch and act on the errors.

> > You think "Oracle". But this application may very well be
> > not Oracle, but diff, or dd, or KMail. I don't want to care.
> > I want all big writes to be efficient, not just those done by Oracle.
> > *Including* single threaded ones.
>
> Then redesign those applications to use aio and O_DIRECT. Incidentally
> I have hacked up dd to do just that and have some very nice performance
> numbers as a result.

I will still disagree on this point (on point "use O_DIRECT, it's faster").
There is no reason why O_DIRECT should be faster than "normal" read/write
to large, aligned buffer. If O_DIRECT is faster on today's kernel,
then Linux' read()/write() can be optimized more.

(I hoped that they can be made even *faster* than O_DIRECT, but as I said,
you convinced me with your "error reporting" argument that reads must still
block until entire buffer is read. Writes can avoid that - apps can do
fdatasync/whatever to make sync writes & error checks if they want).
--
vda

2007-01-25 15:44:10

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> I will still disagree on this point (on point "use O_DIRECT, it's faster").
> There is no reason why O_DIRECT should be faster than "normal" read/write
> to large, aligned buffer. If O_DIRECT is faster on today's kernel,
> then Linux' read()/write() can be optimized more.

Ahh but there IS a reason for it to be faster: the application knows
what data it will require, so it should tell the kernel rather than ask
it to guess. Even if you had the kernel playing vmsplice games to get
avoid the copy to user space ( which still has a fair amount of overhead
), then you still have the problem of the kernel having to guess what
data the application will require next, and try to fetch it early. Then
when the application requests the data, if it is not already in memory,
the application blocks until it is, and blocking stalls the pipeline.

> (I hoped that they can be made even *faster* than O_DIRECT, but as I said,
> you convinced me with your "error reporting" argument that reads must still
> block until entire buffer is read. Writes can avoid that - apps can do
> fdatasync/whatever to make sync writes & error checks if they want).


fdatasync() is not acceptable either because it flushes the entire file.
This does not allow the application to control the ordering of various
writes unless it limits itself to a single write/fdatasync pair at a
time. Further, fdatasync again blocks the application.

With aio, the application can keep several read/writes going in
parallel, thus keeping the pipeline full. Even if the io were not
O_DIRECT, and the kernel played vmsplice games to avoid the copy, it
would still have more overhead, complexity and I think, very little gain
in most cases.


2007-01-25 17:40:48

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thursday 25 January 2007 16:44, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > I will still disagree on this point (on point "use O_DIRECT, it's faster").
> > There is no reason why O_DIRECT should be faster than "normal" read/write
> > to large, aligned buffer. If O_DIRECT is faster on today's kernel,
> > then Linux' read()/write() can be optimized more.
>
> Ahh but there IS a reason for it to be faster: the application knows
> what data it will require, so it should tell the kernel rather than ask
> it to guess. Even if you had the kernel playing vmsplice games to get
> avoid the copy to user space ( which still has a fair amount of overhead
> ), then you still have the problem of the kernel having to guess what
> data the application will require next, and try to fetch it early. Then
> when the application requests the data, if it is not already in memory,
> the application blocks until it is, and blocking stalls the pipeline.
>
> > (I hoped that they can be made even *faster* than O_DIRECT, but as I said,
> > you convinced me with your "error reporting" argument that reads must still
> > block until entire buffer is read. Writes can avoid that - apps can do
> > fdatasync/whatever to make sync writes & error checks if they want).
>
>
> fdatasync() is not acceptable either because it flushes the entire file.

If you opened a file and are doing only O_DIRECT writes, you
*always* have your written data flushed, by each write().
How is it different from writes done using
"normal" write() + fdatasync() pairs?

> This does not allow the application to control the ordering of various
> writes unless it limits itself to a single write/fdatasync pair at a
> time. Further, fdatasync again blocks the application.

Ahhh shit, are you saying that fdatasync will wait until writes
*by all other processes* to thios file will hit the disk?
Is that thue?

--
vda

2007-01-25 19:28:22

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> If you opened a file and are doing only O_DIRECT writes, you
> *always* have your written data flushed, by each write().
> How is it different from writes done using
> "normal" write() + fdatasync() pairs?

Because you can do writes async, but not fdatasync ( unless there is an
async version I don't know about ).

> Ahhh shit, are you saying that fdatasync will wait until writes
> *by all other processes* to thios file will hit the disk?
> Is that thue?


I think all processes yes, but certainly all writes to this file by this
process. That means you have to sync for every write, which means you
block. Blocking stalls the pipeline.

2007-01-25 19:54:55

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thursday 25 January 2007 20:28, Phillip Susi wrote:
> > Ahhh shit, are you saying that fdatasync will wait until writes
> > *by all other processes* to thios file will hit the disk?
> > Is that thue?
>
> I think all processes yes, but certainly all writes to this file by this
> process. That means you have to sync for every write, which means you
> block. Blocking stalls the pipeline.

I dont understand you here. Suppose fdatasync() is "do not return until
all cached writes to this file *done by current process* hit the disk
(i.e. cached write data from other concurrent processes is not waited for),
report succes or error code". Then

write(fd_O_DIRECT, buf, sz) - will wait until buf's data hit the disk

write(fd, buf, sz) - potentially will return sooner, but
fdatasync(fd) - will wait until buf's data hit the disk

Looks same to me.

> > If you opened a file and are doing only O_DIRECT writes, you
> > *always* have your written data flushed, by each write().
> > How is it different from writes done using
> > "normal" write() + fdatasync() pairs?
>
> Because you can do writes async, but not fdatasync ( unless there is an
> async version I don't know about ).

You mean "You can use aio_write" ?
--
vda

2007-01-25 20:03:46

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> You mean "You can use aio_write" ?

Exactly. You generally don't use O_DIRECT without aio. Combining the
two is what gives the big win.

2007-01-25 20:45:47

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Phillip Susi wrote:
> Denis Vlasenko wrote:
>> You mean "You can use aio_write" ?
>
> Exactly. You generally don't use O_DIRECT without aio. Combining the
> two is what gives the big win.

Well, it's not only aio. Multithreaded I/O also helps alot -- all this,
say, to utilize a raid array with many spindles.

But even single-threaded I/O but in large quantities benefits from O_DIRECT
significantly, and I pointed this out before. It's like enabling a write
cache on disk AND doing intensive random writes - the cache - surprizingly -
slows whole thing down by 5..10%.

/mjt

2007-01-25 21:13:57

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> Phillip Susi wrote:
> > Denis Vlasenko wrote:
> >> You mean "You can use aio_write" ?
> >
> > Exactly. You generally don't use O_DIRECT without aio. Combining the
> > two is what gives the big win.
>
> Well, it's not only aio. Multithreaded I/O also helps alot -- all this,
> say, to utilize a raid array with many spindles.
>
> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.
--
vda

2007-01-26 15:55:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:

> Well, I too currently work with Oracle.
> Apparently people who wrote damn thing have very, eh, Oracle-centric
> world-view. "We want direct writes to the disk. Period." Why? Does it
> makes sense? Are there better ways? - nothing. They think they know better.
>
I fear you are taking the Windows approach, that the computer is there
to serve the o/s and applications have to do things the way the o/s
wants. As opposed to the UNIX way, where you can either be clever or
stupid, the o/s is there to allow you to use the hardware, not be your
mother.

Currently applications have the option of letting the o/s make decisions
via open/read/write, or let the o/s make decisions and tell the
application via aio, or using O_DIRECT and having full control over the
process. And that's exactly as it should be. It's not up to the o/s to
be mother.

> (And let's not even start on why oracle ignores SIGTERM. Apparently Unix
> rules aren't for them. They're too big to play by rules.)

Any process can ignore SIGTERM, or do a significant amount of cleanup
before exit()ing. Complex operations need to be completed or unwound.
Why select Oracle? Other applications may also do that, with more or
less valid reasons.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-01-26 16:02:53

by Mark Lord

[permalink] [raw]
Subject: Re: O_DIRECT question

You guys need to backup in this thread.

Every example of O_DIRECT here could be replaced with
calls to mmap(), msync(), and madvise() (or posix_fadvise).

In addition to being at least as fast as O_DIRECT,
these have the added benefit of using the page cache
(avoiding reads for data already present, handling multiple
users of the same data, etc..).

2007-01-26 16:53:33

by Viktor

[permalink] [raw]
Subject: Re: O_DIRECT question

Mark Lord wrote:
> You guys need to backup in this thread.
>
> Every example of O_DIRECT here could be replaced with
> calls to mmap(), msync(), and madvise() (or posix_fadvise).

No. How about handling IO errors? There is no practical way for it with
mmap().

> In addition to being at least as fast as O_DIRECT,
> these have the added benefit of using the page cache
> (avoiding reads for data already present, handling multiple
> users of the same data, etc..).
>
>

2007-01-26 16:58:57

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Mark Lord wrote:
> You guys need to backup in this thread.
>
> Every example of O_DIRECT here could be replaced with
> calls to mmap(), msync(), and madvise() (or posix_fadvise).
>
> In addition to being at least as fast as O_DIRECT,
> these have the added benefit of using the page cache (avoiding reads for
> data already present, handling multiple
> users of the same data, etc..).

Please actually _read_ the thread. In every one of my posts I have
shown why this is not the case.

To briefly rehash the core of the argument, there is no way to
asynchronously manage IO with mmap, msync, madvise -- instead you take
page faults or otherwise block, thus stalling the pipeline.

2007-01-26 17:05:58

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> Which shouldn't be true. There is no fundamental reason why
> ordinary writes should be slower than O_DIRECT.

Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the
kernel-user copy, and when coupled with multithreading or aio, allows
the IO queues to be kept full with useful transfers at all times.
Normal read/write requires the kernel to buffer and guess access
patterns correctly to perform read ahead and write behind perfectly to
keep the queues full. In practice, this does not happen perfectly all
of the time, or even most of the time, so it slows things down.


2007-01-26 18:24:50

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
>> Phillip Susi wrote:
>>> Denis Vlasenko wrote:
>>>> You mean "You can use aio_write" ?
>>> Exactly. You generally don't use O_DIRECT without aio. Combining the
>>> two is what gives the big win.
>> Well, it's not only aio. Multithreaded I/O also helps alot -- all this,
>> say, to utilize a raid array with many spindles.
>>
>> But even single-threaded I/O but in large quantities benefits from O_DIRECT
>> significantly, and I pointed this out before.
>
> Which shouldn't be true. There is no fundamental reason why
> ordinary writes should be slower than O_DIRECT.
>
Other than the copy to buffer taking CPU and memory resources.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-01-26 23:19:07

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Friday 26 January 2007 18:05, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
>
> Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the
> kernel-user copy,

You assume that ordinary read()/write() is *required* to do the copying.
It doesn't. Kernel is allowed to do direct DMAing in this case too.

> and when coupled with multithreading or aio, allows
> the IO queues to be kept full with useful transfers at all times.

Again, ordinary I/O is no different. Especially on fds opened with O_SYNC,
write() will behave very similarly to O_DIRECT one - data is guaranteed
to hit the disk before write() returns.

> Normal read/write requires the kernel to buffer and guess access

No it doesn't *require* that.

> patterns correctly to perform read ahead and write behind perfectly to
> keep the queues full. In practice, this does not happen perfectly all
> of the time, or even most of the time, so it slows things down.

So lets fix the kernel for everyone's benefit intead of "give us
an API specifically for our needs".
--
vda

2007-01-26 23:37:48

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >> Phillip Susi wrote:
> >>> Denis Vlasenko wrote:
> >>>> You mean "You can use aio_write" ?
> >>> Exactly. You generally don't use O_DIRECT without aio. Combining the
> >>> two is what gives the big win.
> >> Well, it's not only aio. Multithreaded I/O also helps alot -- all this,
> >> say, to utilize a raid array with many spindles.
> >>
> >> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> >> significantly, and I pointed this out before.
> >
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> >
> Other than the copy to buffer taking CPU and memory resources.

It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.
--
vda

2007-01-27 14:01:37

by Bodo Eggert

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko <[email protected]> wrote:
> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
>> Denis Vlasenko wrote:
>> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:

>> >> But even single-threaded I/O but in large quantities benefits from
>> >> O_DIRECT significantly, and I pointed this out before.
>> >
>> > Which shouldn't be true. There is no fundamental reason why
>> > ordinary writes should be slower than O_DIRECT.
>> >
>> Other than the copy to buffer taking CPU and memory resources.
>
> It is not required by any standard that I know. Kernel can be smarter
> and avoid that if it can.

The kernel can also solve the halting problem if it can.

Do you really think an entropy estamination code on all access patterns in the
system will be free as in beer, or be able to predict the access pattern of
random applications?
--
Top 100 things you don't want the sysadmin to say:
86. What do you mean that wasn't a copy?

Fri?, Spammer: [email protected] [email protected]

2007-01-27 14:16:46

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> Denis Vlasenko <[email protected]> wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
>
> >> >> But even single-threaded I/O but in large quantities benefits from
> >> >> O_DIRECT significantly, and I pointed this out before.
> >> >
> >> > Which shouldn't be true. There is no fundamental reason why
> >> > ordinary writes should be slower than O_DIRECT.
> >> >
> >> Other than the copy to buffer taking CPU and memory resources.
> >
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
>
> The kernel can also solve the halting problem if it can.
>
> Do you really think an entropy estamination code on all access patterns in the
> system will be free as in beer,

Actually I think we need this heuristic:

if (opened_with_O_STREAM && buffer_is_aligned
&& io_size_is_a_multiple_of_sectorsize)
do_IO_directly_to_user_buffer_without_memcpy

is not *that* compilcated.

I think that we can get rid of O_DIRECT peculiar requirements
"you *must* not cache me" + "you *must* write me directly to bare metal"
by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
("write() should return only when data is written to storage, not sooner").

Why?

Because these O_DIRECT "musts" are rather unusual and overkill. Apps
should not have that much control over what kernel does internally;
and also O_DIRECT was mixing shampoo and conditioner on one bottle
(no-cache and sync writes) - bad API.
--
vda

2007-01-28 15:19:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
>> Denis Vlasenko wrote:
>>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
>>>> Phillip Susi wrote:

[...]

>>>> But even single-threaded I/O but in large quantities benefits from O_DIRECT
>>>> significantly, and I pointed this out before.
>>> Which shouldn't be true. There is no fundamental reason why
>>> ordinary writes should be slower than O_DIRECT.
>>>
>> Other than the copy to buffer taking CPU and memory resources.
>
> It is not required by any standard that I know. Kernel can be smarter
> and avoid that if it can.

Actually, no, the whole idea of page cache is that overall system i/o
can be faster if data sit in the page cache for a while. But the real
problem is that the application write is now disconnected from the
physical write, both in time and order.

No standard says the kernel couldn't do direct DMA, but since having
that required is needed to guarantee write order and error status linked
to the actual application i/o, what a kernel "might do" is irrelevant.

It's much easier to do O_DIRECT by actually doing the direct i/o than to
try to catch all the corner cases which arise in faking it.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-01-28 15:31:17

by Bill Davidsen

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
>> Denis Vlasenko <[email protected]> wrote:
>>> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
>>>> Denis Vlasenko wrote:
>>>>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
>>>>>> But even single-threaded I/O but in large quantities benefits from
>>>>>> O_DIRECT significantly, and I pointed this out before.
>>>>> Which shouldn't be true. There is no fundamental reason why
>>>>> ordinary writes should be slower than O_DIRECT.
>>>>>
>>>> Other than the copy to buffer taking CPU and memory resources.
>>> It is not required by any standard that I know. Kernel can be smarter
>>> and avoid that if it can.
>> The kernel can also solve the halting problem if it can.
>>
>> Do you really think an entropy estamination code on all access patterns in the
>> system will be free as in beer,
>
> Actually I think we need this heuristic:
>
> if (opened_with_O_STREAM && buffer_is_aligned
> && io_size_is_a_multiple_of_sectorsize)
> do_IO_directly_to_user_buffer_without_memcpy
>
> is not *that* compilcated.
>
> I think that we can get rid of O_DIRECT peculiar requirements
> "you *must* not cache me" + "you *must* write me directly to bare metal"
> by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
> ("write() should return only when data is written to storage, not sooner").
>
> Why?
>
> Because these O_DIRECT "musts" are rather unusual and overkill. Apps
> should not have that much control over what kernel does internally;
> and also O_DIRECT was mixing shampoo and conditioner on one bottle
> (no-cache and sync writes) - bad API.

What a shame that other operating systems can manage to really support
O_DIRECT, and that major application software can use this api to write
portable code that works even on Windows.

You overlooked the problem that applications using this api assume that
reads are on bare metal as well, how do you address the case where
thread A does a write, thread B does a read? If you give thread B data
from a buffer and it then does a write to another file (which completes
before the write from thread A), and then the system crashes, you have
just put the files out of sync. So you may have to block all i/o for all
threads of the application to be sure that doesn't happen. Or introduce
some complex way to assure that all writes are physically done in
order... that sounds like a lock infested mess to me, assuming that you
could ever do it right.

Oracle has their own version of Linux now, do you think that they would
fork the application or the kernel?

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-01-28 17:05:22

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sunday 28 January 2007 16:18, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >>>> Phillip Susi wrote:
>
> [...]
>
> >>>> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> >>>> significantly, and I pointed this out before.
> >>> Which shouldn't be true. There is no fundamental reason why
> >>> ordinary writes should be slower than O_DIRECT.
> >>>
> >> Other than the copy to buffer taking CPU and memory resources.
> >
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
>
> Actually, no, the whole idea of page cache is that overall system i/o
> can be faster if data sit in the page cache for a while. But the real
> problem is that the application write is now disconnected from the
> physical write, both in time and order.

Not in O_SYNC case.

> No standard says the kernel couldn't do direct DMA, but since having
> that required is needed to guarantee write order and error status linked
> to the actual application i/o, what a kernel "might do" is irrelevant.
>
> It's much easier to do O_DIRECT by actually doing the direct i/o than to
> try to catch all the corner cases which arise in faking it.

I still don't see much difference between O_SYNC and O_DIRECT write
semantic.
--
vda

2007-01-28 17:20:27

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sunday 28 January 2007 16:30, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> >> Denis Vlasenko <[email protected]> wrote:
> >>> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >>>> Denis Vlasenko wrote:
> >>>>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >>>>>> But even single-threaded I/O but in large quantities benefits from
> >>>>>> O_DIRECT significantly, and I pointed this out before.
> >>>>> Which shouldn't be true. There is no fundamental reason why
> >>>>> ordinary writes should be slower than O_DIRECT.
> >>>>>
> >>>> Other than the copy to buffer taking CPU and memory resources.
> >>> It is not required by any standard that I know. Kernel can be smarter
> >>> and avoid that if it can.
> >> The kernel can also solve the halting problem if it can.
> >>
> >> Do you really think an entropy estamination code on all access patterns in the
> >> system will be free as in beer,
> >
> > Actually I think we need this heuristic:
> >
> > if (opened_with_O_STREAM && buffer_is_aligned
> > && io_size_is_a_multiple_of_sectorsize)
> > do_IO_directly_to_user_buffer_without_memcpy
> >
> > is not *that* compilcated.
> >
> > I think that we can get rid of O_DIRECT peculiar requirements
> > "you *must* not cache me" + "you *must* write me directly to bare metal"
> > by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
> > ("write() should return only when data is written to storage, not sooner").
> >
> > Why?
> >
> > Because these O_DIRECT "musts" are rather unusual and overkill. Apps
> > should not have that much control over what kernel does internally;
> > and also O_DIRECT was mixing shampoo and conditioner on one bottle
> > (no-cache and sync writes) - bad API.
>
> What a shame that other operating systems can manage to really support
> O_DIRECT, and that major application software can use this api to write
> portable code that works even on Windows.
>
> You overlooked the problem that applications using this api assume that
> reads are on bare metal as well, how do you address the case where
> thread A does a write, thread B does a read? If you give thread B data
> from a buffer and it then does a write to another file (which completes
> before the write from thread A), and then the system crashes, you have
> just put the files out of sync.

Applications which syncronize their data integrity
by keeping data on hard drive and relying on
"read goes to bare metal, so it can't see written data
before it gets written to bare metal". Wow, this is slow.
Are you talking about this scenario:

Bad:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
.... read(fd, buf2); [starts after write 1 started]
.... write(somewhere_else, buf2);
.... (write returns)
.... <---- crash point
(write returns)

This will be *very* slow - if you use O_DIRECT and do what
is depicted above, you write data, then you read it back,
whic is slow. Why do you want that? Isn't it
much faster to just wait for write to complete, and allow
read to fetch (potentially) cached data?

Better:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
.... (wait for write to finish)
....
....
.... <---- crash point
(write returns)
.... read(fd, buf2); [starts after write 1 started]
.... write(somewhere_else, buf2);
.... (write returns)

> So you may have to block all i/o for all
> threads of the application to be sure that doesn't happen.

Not all, only related i/o.
--
vda

2007-01-29 15:43:35

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> I still don't see much difference between O_SYNC and O_DIRECT write
> semantic.

Yes, if you change the normal io paths to properly support playing
vmsplice games ( which have a number of corner cases ) to get the zero
copy, and support madvise() and O_SYNC to control caching behavior, and
fix all the error handling corner cases, then you may be able to do away
with O_DIRECT.

I believe that doing all that will be much more complex than O_DIRECT
however.

2007-01-29 16:58:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT question

On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
> I still don't see much difference between O_SYNC and O_DIRECT write
> semantic.

O_DIRECT is about avoiding the copy_user between cache and userland,
when working with devices that runs faster than ram (think >=100M/sec,
quite standard hardware unless you've only a desktop or you cannot
afford raid).

O_SYNC is about working around buggy or underperforming VM growing the
dirty levels beyond optimal levels, or to open logfiles that you want
to save to disk ASAP (most other journaling usages are better done
with fsync instead). Or you can mount the fs in sync mode when you
deal with users not capable of unmounting devices before unplugging
them. Ideally you should never need O_SYNC, when you need O_SYNC it's
usually a very bad sign. If you need O_DIRECT it's not a bad sign
(needing O_DIRECT is mostly a sign you've a very fast storage).

The only case where I ever used O_SYNC myself is during backups (when
run on standard or mainline kernels that dirty half ram during
backup). For the logfiles I don't find it very useful, if something I
log them remotely (when system crashes usually the logs won't hit the
disk anyway, so it's just slower).

I use "tar | dd oflag=sync" and that generates a huge speedup to the
rest of the system (not necessairly to the backup itself). Yes I could
use even oflag=direct, but I'm fine to pass through the cache (the
backup device runs at 10M/sec through USB, so the copy_user is _sure_
worth it, if something it will help, it will never be a measurable
slowdown), what is not fine is to see half of the ram dirty the whole
time... (hence the need of o_sync).

O_SYNC and O_DIRECT are useful for different scenarios.

2007-01-30 00:07:57

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_DIRECT question

On Monday 29 January 2007 18:00, Andrea Arcangeli wrote:
> On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
> > I still don't see much difference between O_SYNC and O_DIRECT write
> > semantic.
>
> O_DIRECT is about avoiding the copy_user between cache and userland,
> when working with devices that runs faster than ram (think >=100M/sec,
> quite standard hardware unless you've only a desktop or you cannot
> afford raid).

Yes, I know that, but O_DIRECT is also "overloaded" with
O_SYNC-like semantic too ("write doesnt return until data hits
physical media"). To have two ortogonal things "mixed together"
in one flag feels "not Unixy" to me. So I am trying to formulate
saner semantic. So far I think that this looks good:

O_SYNC - usual meaning
O_STREAM - do not try hard to cache me. This includes "if you can
(buffer is sufficiently aligned, yadda, yadda), do not
copy_user into pagecache but just DMA from userspace
pages" - exactly because user told us that he is not
interested in caching!

Then O_DIRECT is approximately = O_SYNC + O_STREAM, and I think
maybe Linus will not hate this "new" O_DIRECT - it doesn't
bypass pagecache.

> O_SYNC is about working around buggy or underperforming VM growing the
> dirty levels beyond optimal levels, or to open logfiles that you want
> to save to disk ASAP (most other journaling usages are better done
> with fsync instead).

I've got a feeling that db people use O_DIRECT (its O_SYNCy behaviour)
as a poor man's write barrier when they must be sure that their redo
logs have hit storage before they start to modify datafiles.
Another reason why they want sync writes is write error detection.
They cannot afford delaying it.
--
vda

2007-01-30 18:50:52

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Andrea Arcangeli wrote:
> On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote:
>> Did you intentionally drop this reply off list?
>
> No.

Then I'll restore the lkml to the cc list.

>> No, it doesn't... or at least can't report WHERE the error is.
>
> O_SYNC doesn't report where the error is either, try a write(fd, buf,
> 10*1024*1024).

It should return the number of bytes successfully written before the
error, giving you the location of the first error. Also using smaller
individual writes ( preferably issued in parallel ) also allows the
problem spot to be isolated.

>> Typically you only want one sector of data to be written before you
>> continue. In the cases where you don't, this might be nice, but as I
>> said above, you can't handle errors properly.
>
> Sorry but you're dreaming if you're thinking anything in real life
> writes at 512bytes at time with O_SYNC. Try that with any modern
> harddisk.

When you are writing a transaction log, you do; you don't need much
data, but you do need to be sure it has hit the disk before continuing.
You certainly aren't writing many mb across a dozen write() calls and
only then care to make sure it is all flushed in an unknown order. When
order matters, you can not use fsync, which is one of the reasons why
databases use O_DIRECT; they care about the ordering.

>>> Just grep for fsync in the db code of your choice (try postgresql) and
>>> then explain me why they ever call fsync in their code, if you know
>>> how to do better with O_SYNC ;).
>> Doesn't sound like a very good idea to me.
>
> Why not a good idea to check any real life app?

I meant it is not a good idea to use fsync as you can't properly handle
errors.

>> The stalling is caused by cache pollution. Since you did not specify a
>> block size dd uses the base block size of the output disk. When
>> combined with sync, only one block is written at a time, and no more
>> until the first block has been flushed. Only then does dd send down
>> another block to write. Without dd the kernel is likely allowing many
>> mb to be queued in the buffer cache. Limiting output to one block at a
>> time is not good for throughput, but allowing half of ram to be used by
>> dirty pages is not good either.
>
> Throughput is perfect. I forgot to tell I combine it with ibs=4k
> obs=16M. Like it would be perfect with odirect too for the same
> reason. Stalling the I/O pipeline once every 16M isn't measurable in

Throughput is nowhere near perfect, as the pipeline is stalled for quite
some time. The pipe fills up quickly while dd is blocked on the sync
write, which then blocks tar until all 16 MB have hit the disk. Only
then does dd go back to reading from the tar pipe, allowing it to
continue. During the time it takes tar to archive another 16 MB of
data, the write queue is empty. The only time that the tar process gets
to continue running while data is written to disk is in the small time
it takes for the pipe ( 4 KB isn't it? ) to fill up.

>> The semantics of the two are very much the same; they only differ in the
>> internal implementation. As far as the caller is concerned, in both
>> cases, he is sure that writes are safe on the disk when they return, and
>> reads semantically are no different with either flag. The internal
>> implementations lead to different performance characteristics, and the
>> other post was simply commenting that the performance characteristics of
>> O_SYNC + madvise() is almost the same as O_DIRECT, or even better in
>> some cases ( since the data read may already be in cache ).
>
> The semantics mandates the implementation because the semantics make
> up for the performance expectations. For the same reason you shouldn't
> write 512bytes at time with O_SYNC you also shouldn't use O_SYNC if
> your device risks to create a bottleneck in the CPU and memory.

No, semantics have nothing to do with performance. Semantics deals with
the state of the machine after the call, not how quickly it got there.
Semantics is a question of correct operation, not optimal.

With both O_DIRECT and O_SYNC, the machine state is essentially the same
after the call: the data has hit the disk. Aside from the performance
difference, the application can not tell the difference between O_DIRECT
and O_SYNC, so if that performance difference can be resolved by
changing the implementation, Linus can be happy and get rid of O_DIRECT.


2007-01-30 19:54:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT question

On Tue, Jan 30, 2007 at 01:50:41PM -0500, Phillip Susi wrote:
> It should return the number of bytes successfully written before the
> error, giving you the location of the first error. Also using smaller
> individual writes ( preferably issued in parallel ) also allows the
> problem spot to be isolated.

When you have I/O errors during _writes_ (not Read!!) the raid must
kick the disk out of the array before the OS ever notices. And if it's
software raid that you're using, the OS should kick out the disk
before your app ever notices any I/O error. when the write I/O error
happens, it's not a problem for the application to solve.

when the I/O error reaches the filesystem if you're lucky if the OS
won't crash (ext3 claims to handle it), if your app receives the I/O
error all you should be doing is to shutdown things gracefully sending
all errors you can to the admin.

It doesn't matter much where the error happend, all it matters is that
you didn't have a fault tolerant raid setup (your fault) and your
primary disk just died and you're now screwed(tm). If you could trust
that part of the disk is still sane you could perhaps attempt to avoid
a restore from the last backup, otherwise all you can do is the
equivalent of a e2fsck -f on the db metadata after copying what you
can still read to the new device.

The only time I got a I/O error on writes, about 1G of the disk was
gone, not very useful to know the first 512byte region that
failed... unreadable and unwriteable. Every other time, writing to the
disk actually solved the read I/O error (they weren't write I/O errors
of course).

Now if you're careful enough you can track down which data generated
the I/O error by queuing the blocks that you write in between every
fsync. So you can still know if perhaps only the journal has generated
write I/O errors, in such a case you could tell the user that he can
copy the data files and let the journal be regenerated on the new
disk. I doubt it will help much in practice though (in such a case, I
would always restore the last backup just in case).

> >>Typically you only want one sector of data to be written before you
> >>continue. In the cases where you don't, this might be nice, but as I
> >>said above, you can't handle errors properly.
> >
> >Sorry but you're dreaming if you're thinking anything in real life
> >writes at 512bytes at time with O_SYNC. Try that with any modern
> >harddisk.
>
> When you are writing a transaction log, you do; you don't need much
> data, but you do need to be sure it has hit the disk before continuing.
> You certainly aren't writing many mb across a dozen write() calls and
> only then care to make sure it is all flushed in an unknown order. When
> order matters, you can not use fsync, which is one of the reasons why
> databases use O_DIRECT; they care about the ordering.

Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
offers exactly the same guarantees. Feel free to check the real life
db code. Even bdb uses fsync.

> >>>Just grep for fsync in the db code of your choice (try postgresql) and
> >>>then explain me why they ever call fsync in their code, if you know
> >>>how to do better with O_SYNC ;).
> >>Doesn't sound like a very good idea to me.
> >
> >Why not a good idea to check any real life app?
>
> I meant it is not a good idea to use fsync as you can't properly handle
> errors.

See above.

> >>The stalling is caused by cache pollution. Since you did not specify a
> >>block size dd uses the base block size of the output disk. When
> >>combined with sync, only one block is written at a time, and no more
> >>until the first block has been flushed. Only then does dd send down
> >>another block to write. Without dd the kernel is likely allowing many
> >>mb to be queued in the buffer cache. Limiting output to one block at a
> >>time is not good for throughput, but allowing half of ram to be used by
> >>dirty pages is not good either.
> >
> >Throughput is perfect. I forgot to tell I combine it with ibs=4k
> >obs=16M. Like it would be perfect with odirect too for the same
> >reason. Stalling the I/O pipeline once every 16M isn't measurable in
>
> Throughput is nowhere near perfect, as the pipeline is stalled for quite
> some time. The pipe fills up quickly while dd is blocked on the sync
> write, which then blocks tar until all 16 MB have hit the disk. Only
> then does dd go back to reading from the tar pipe, allowing it to
> continue. During the time it takes tar to archive another 16 MB of
> data, the write queue is empty. The only time that the tar process gets
> to continue running while data is written to disk is in the small time
> it takes for the pipe ( 4 KB isn't it? ) to fill up.

Please try yourself, it's simple enough:

time dd if=/dev/hda of=/dev/null bs=16M count=100
time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct

if you can measure any slowdown in the sync/direct you're welcome (it
runs faster here... as it should). The pipeline stall is not
measurable when it's so infrequent, and actually the pipeline stall is
not a big issue when the I/O is contigous and the dma commands are
always large.

aio is mandatory only while dealing with small buffers, especially
while seeking to take advantage of the elevator.

> No, semantics have nothing to do with performance. Semantics deals with
> the state of the machine after the call, not how quickly it got there.
> Semantics is a question of correct operation, not optimal.

This whole thing is about performance, if you remove performance
factors from the equation, you can stick to your O_SYNC 512bytes at
time to the journal design. You're perfectly right that when you
remove performance from the equation you can claim that O_DIRECT is
much the same as O_SYNC.

> With both O_DIRECT and O_SYNC, the machine state is essentially the same
> after the call: the data has hit the disk. Aside from the performance
> difference, the application can not tell the difference between O_DIRECT
> and O_SYNC, so if that performance difference can be resolved by
> changing the implementation, Linus can be happy and get rid of O_DIRECT.

Guess what, if O_SYNC could run as fast as O_DIRECT by still passing
through pagecache, O_DIRECT wouldn't exist. You can't pretend to
describe the semantics of any kernel API if you remove performance
considerations from it. It must be some not useful university theory
if they thought you that performance evaluation must not be present in
the semantics. If that's the case, it's best you stop talking about
semantics when you discuss about any kernel APIs. A ton of kernel APIs
are all about improving performance, so they'll all be the same if you
only look at your performance agnostic semantics, it's not just
O_DIRECT that would become the same as O_SYNC.

The thing I could imagine to avoid bypassing the pagecache, would be
to make MAP_SHARED asynchronous with a new readahead AIO method,
combined with a trick to fill pagetables scattered all over the
address space of the task with a single entry in the kernel (like
mlock does but with SG and w/o pte pinning), and then you would need
to find a way to map ext3/4 pagecache using largepages but without
triggering 2M or 1G reads at every largepte fill. If you think about
it, you'll probably realize O_DIRECT+hugetlbfs is much cleaner and
probably still faster ;). And still MAP_SHARED of ext4 wouldn't be the
same as MAP_SHARED of tmpfs, as it forbids to build logical indexes in
place (you'd need to cow before you can modify the cache).

2007-01-30 20:03:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT question

On Tue, Jan 30, 2007 at 08:57:20PM +0100, Andrea Arcangeli wrote:
> Please try yourself, it's simple enough:
>
> time dd if=/dev/hda of=/dev/null bs=16M count=100
> time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync

sorry, reading won't help much to exercise sync ;). But the direct
line is enough to show the effect of I/O pipeline stall. To
effectively test sync of course you want to write to a file instead
(unless you want to wipe out /dev/hda ;)

2007-01-30 23:07:24

by Phillip Susi

[permalink] [raw]
Subject: Re: O_DIRECT question

Andrea Arcangeli wrote:
> When you have I/O errors during _writes_ (not Read!!) the raid must
> kick the disk out of the array before the OS ever notices. And if it's
> software raid that you're using, the OS should kick out the disk
> before your app ever notices any I/O error. when the write I/O error
> happens, it's not a problem for the application to solve.

I thought it obvious that we were talking about non recoverable errors
that then DO make it to the application. And any kind of mission
critical app most definitely does care about write errors. You don't
need your db completing the transaction when it was only half recorded.
It needs to know it failed so it can back out and/or recover the data
and record it elsewhere. You certainly don't want the users to think
everything is fine, walk away, and have the system continue to limp on
making things worse by the second.

> when the I/O error reaches the filesystem if you're lucky if the OS
> won't crash (ext3 claims to handle it), if your app receives the I/O
> error all you should be doing is to shutdown things gracefully sending
> all errors you can to the admin.

If the OS crashes due to an IO error reading user data, then there is
something seriously wrong and beyond the scope of this discussion. It
suffices to say that due to the semantics of write() and sound
engineering practice, the application expects to be notified of errors
so it can try to recover, or fail gracefully. Whether it chooses to
fail gracefully as you say it should, or recovers from the error, it
needs to know that an error happened, and where it was.

> It doesn't matter much where the error happend, all it matters is that
> you didn't have a fault tolerant raid setup (your fault) and your
> primary disk just died and you're now screwed(tm). If you could trust
> that part of the disk is still sane you could perhaps attempt to avoid
> a restore from the last backup, otherwise all you can do is the
> equivalent of a e2fsck -f on the db metadata after copying what you
> can still read to the new device.

It most certainly matters where the error happened because "you are
screwd" is not an acceptable outcome in a mission critical application.
A well engineered solution will deal with errors as best as possible,
not simply give up and tell the user they are screwed because the
designer was lazy. There is a reason that read and write return the
number of bytes _actually_ transfered, and the application is supposed
to check that result to verify proper operation.

> Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
> offers exactly the same guarantees. Feel free to check the real life
> db code. Even bdb uses fsync.

No, there is a slight difference. An fsync() flushes all dirty buffers
in an undefined order. Using O_DIRECT or O_SYNC, you can control the
flush order because you can simply wait for one set of writes to
complete before starting another set that must not be written until
after the first are on the disk. You can emulate that by placing an
fsync between both sets of writes, but that will flush any other dirty
buffers whose ordering you do not care about. Also there is no aio
version of fsync.

> Please try yourself, it's simple enough:
>
> time dd if=/dev/hda of=/dev/null bs=16M count=100
> time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
> time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct
>
> if you can measure any slowdown in the sync/direct you're welcome (it
> runs faster here... as it should). The pipeline stall is not
> measurable when it's so infrequent, and actually the pipeline stall is
> not a big issue when the I/O is contigous and the dma commands are
> always large.
>
> aio is mandatory only while dealing with small buffers, especially
> while seeking to take advantage of the elevator.
>


sync has no effect on reading, so that test is pointless. direct saves
the cpu overhead of the buffer copy, but isn't good if the cache isn't
entirely cold. The large buffer size really has little to do with it,
rather it is the fact that the writes to null do not block dd from
making the next read for any length of time. If dd were blocking on an
actual output device, that would leave the input device idle for the
portion of the time that dd were blocked.

In any case, this is a totally different example than your previous one
which had dd _writing_ to a disk, where it would block for long periods
of time due to O_SYNC, thereby preventing it from reading from the input
buffer in a timely manner. By not reading the input pipe frequently, it
becomes full and thus, tar blocks. In that case the large buffer size
is actually a detriment because with a smaller buffer size, dd would not
be blocked as long and so it could empty the pipe more frequently
allowing tar to block less.

> This whole thing is about performance, if you remove performance
> factors from the equation, you can stick to your O_SYNC 512bytes at
> time to the journal design. You're perfectly right that when you
> remove performance from the equation you can claim that O_DIRECT is
> much the same as O_SYNC.

> Guess what, if O_SYNC could run as fast as O_DIRECT by still passing
> through pagecache, O_DIRECT wouldn't exist. You can't pretend to
> describe the semantics of any kernel API if you remove performance
> considerations from it. It must be some not useful university theory
> if they thought you that performance evaluation must not be present in
> the semantics. If that's the case, it's best you stop talking about
> semantics when you discuss about any kernel APIs. A ton of kernel APIs
> are all about improving performance, so they'll all be the same if you
> only look at your performance agnostic semantics, it's not just
> O_DIRECT that would become the same as O_SYNC.

You seem to have missed the point of this thread. Denis Vlasenko's
message that you replied to simply pointed out that they are
semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC
+ madvise could be fixed to perform as well. Several people including
Linus seem to like this idea and think it is quite possible.

2007-01-31 02:26:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT question

On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote:
> It most certainly matters where the error happened because "you are
> screwd" is not an acceptable outcome in a mission critical application.

An I/O error is not an acceptable outcome in a mission critical app,
all mission critical setups should be fault tolerant, so if raid
cannot recover at the first sign of error the whole system should
instantly go down and let the secondary takeover from it. See slony
etc...

Trying to recover the recoverable by mucking up with data making even
_more_ writes on a failing disk before doing physical mirror image of
the disk (the readable part) isn't a good idea IMHO. At best you could
retry writing on the same sector hoping somebody disconnected the scsi
cable by mistake.

> A well engineered solution will deal with errors as best as possible,
> not simply give up and tell the user they are screwed because the
> designer was lazy. There is a reason that read and write return the
> number of bytes _actually_ transfered, and the application is supposed
> to check that result to verify proper operation.

You can track the range where it happened with fsync too like said in
previous email, and you can take the big database lock and then
read-write read-write every single block in that range until you find
the failing place if you really want to. read-write in place should be
safe.

> No, there is a slight difference. An fsync() flushes all dirty buffers
> in an undefined order. Using O_DIRECT or O_SYNC, you can control the
> flush order because you can simply wait for one set of writes to
> complete before starting another set that must not be written until
> after the first are on the disk. You can emulate that by placing an
> fsync between both sets of writes, but that will flush any other
> dirty

Doing fsync after every write will provide the same ordering
guarantee as O_SYNC, thought it was obvious what I meant here.

The whole point is that most of the time you don't need it, you need
an fsync after a couple of writes. All smtp servers uses fsync for the
same reason, they also have to journal their writes to avoid losing
email when there is a power loss.

If you use writev or aio pwrite you can do well with O_SYNC too though.

> buffers whose ordering you do not care about. Also there is no aio
> version of fsync.

please have a second look at aio_abi.h:

IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,

there must be a reason why they exist, right?

> sync has no effect on reading, so that test is pointless. direct saves
> the cpu overhead of the buffer copy, but isn't good if the cache isn't
> entirely cold. The large buffer size really has little to do with it,

direct bypasses the cache so the cache is freezing not just cold.

> rather it is the fact that the writes to null do not block dd from
> making the next read for any length of time. If dd were blocking on an
> actual output device, that would leave the input device idle for the
> portion of the time that dd were blocked.

The objective was to measure the pipeline stall, if you stall it for
other reason anyway what's the point?

> In any case, this is a totally different example than your previous one
> which had dd _writing_ to a disk, where it would block for long periods
> of time due to O_SYNC, thereby preventing it from reading from the input
> buffer in a timely manner. By not reading the input pipe frequently, it
> becomes full and thus, tar blocks. In that case the large buffer size
> is actually a detriment because with a smaller buffer size, dd would not
> be blocked as long and so it could empty the pipe more frequently
> allowing tar to block less.

It would run slower with smaller buffer size because it would block
too and it would read and write slower too. For my backup usage
keeping tar blocked is actually a feature, so the load of the backup
decreases. To me it's important the MB/sec of the writes and the
MB/sec of the reads (to lower the load), I don't care too much about
how long it takes as far as things runs as efficiently as possible
when they run. The rate limiting effect of the blocking isn't a
problem to me.

> You seem to have missed the point of this thread. Denis Vlasenko's
> message that you replied to simply pointed out that they are
> semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC
> + madvise could be fixed to perform as well. Several people including
> Linus seem to like this idea and think it is quite possible.

I answered to that email to point out the fundamental differences
between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry
but that's how things are running today and I don't see quite possible
to change (unless of course we remove performance from the equation,
then indeed they'll be much the same).

Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages
loaded with a new syscall that reads a piece at time into the large
pagecache, could be an alternative design, or perhaps splice could
obsolete O_DIRECT. I've just a very hard time to see how
O_SYNC+madvise could ever obsolete O_DIRECT.

2007-01-31 09:37:28

by Michael Tokarev

[permalink] [raw]
Subject: Re: O_DIRECT question

Phillip Susi wrote:
[]
> You seem to have missed the point of this thread. Denis Vlasenko's
> message that you replied to simply pointed out that they are
> semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC
> + madvise could be fixed to perform as well. Several people including
> Linus seem to like this idea and think it is quite possible.

By the way, IF O_SYNC+madvise could be "fixed", can't O_DIRECT be implemented
internally using them?

I mean, during open(O_DIRECT), do open(O_SYNC) instead and call madvise()
appropriately....

/mjt

2007-02-06 20:39:31

by Pavel Machek

[permalink] [raw]
Subject: Re: O_DIRECT question

Hi!

> > > Which shouldn't be true. There is no fundamental reason why
> > > ordinary writes should be slower than O_DIRECT.
> >
> > Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the
> > kernel-user copy,
>
> You assume that ordinary read()/write() is *required* to do the copying.
> It doesn't. Kernel is allowed to do direct DMAing in this case too.

Kernel is allowed, but it is practically impossible to code. It would
require slow MMU magic.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html