2008-03-19 19:45:12

by Daniel Phillips

[permalink] [raw]
Subject: [ANNOUNCE] ddtree: A git kernel tree for storage servers

Hi all,

I have set up a new git tree to better serve the needs of those
interested in advanced storage applications and development.

This ddtree git repository aims to provide a congenial forum for
development of forward looking storage features such as replication and
clustering; and to provide improved kernels for those who consider it
important that their storage servers run efficiently under heavy load
without deadlocking.

What will be in this ddtree?

* Block layer deadlock fixes (Status: production)
* bio allocation optimizations (Status: functional)
* bio support for stacking block devices (Status: functional)
* vm dirty limit eradication (Status: prototype)
* vm dirty rate balancing (Status: prototype)
* ddlink generic device driver control library (Status: functional)
* ddsetup device mapper frontend rewrite (Status: incomplete)
* ddman kernel cluster harness (Status: upcoming)
* ddraid distributed raid (Status: prototype)
* ddsnap replicating snapshot device (Status: alpha)

Patch set tracking

One task that git does not support well at all is maintaining the
identity of patches and patch sets. This is no doubt due to the fact
that Graydon Hoare[1] never implemented the second of my two
suggestions for improving Monotone's database schema[2], which is to
say that patches and patch sets should be first class, versioned
objects in the revision control database. One could fairly say that
git caters more to maintainers than submitters, the latter being
largely left to their own devices when it comes to splitting deltas up
into the modular form wanted for peer review. My partial solution to
this deficiency is to embed the interesting patches in a directory
called "patches", each named in such a way that:

ls patches/* | sort -r | xargs cat | patch -p1 -RE

will reverse them, and:

cat patches/* | patch -p1

will re-apply them. This is similar to the way Quilt works, less its
series file, which is replaced by a naming convention that is obvious
from inspection. No doubt I should really begin using Quilt, but I can
always learn that art later. For now, the important thing is to carry
along patch identities in a way that makes life easier for me.

At present, ddtree only carries six patches, all in its main "dd"
branch. These patches are based off of the "ddbase" branch, which is
in turn derived from either a two-dot or a three-dot stable kernel
release. Thus, I intend to track Linus's tree at coarse intervals and
selected stable releases at finer intervals, which will most likely
coincide with significant distributor branch points such as that of
Ubuntu Hardy (long term stable server release).

So for today:

$ tree patches
patches
|-- bio-alloc
|-- bio-alloc-hide-endio
|-- bio-alloc-stack
|-- bio-alloc-stack-reduce-dm-allocs
|-- ddlink
`-- ddlink-ddsetup

Other patches expected to land here over the next few days:

* bio.throttle (avoid bio write deadlock)
* ddsnap (snapshots and replication)
* ramback (backing store for ramdisks)

I am still learning git and developing my workflow, so it will take a
few days for that to settle down, during which period I will tear down
and rebase the content several times. Currently I have very limited
bandwidth available, so please be gentle and avoid clone - just clone
a linus tree and pull into that instead. (hpa...?)

To browse ddtree:

http://phunq.net/ddtree

To pull from ddtree:

http://phunq.net/git/ddtree (please do not clone for now)

For now there is no git:protocol access because git-daemon manifests
some strange issue I have not yet had time to track down. The symptom
is this:

$ tail /var/log/git-daemon/current
2008-03-19_03:49:26.67922 [1068] Request upload-pack for '/ddtree'
2008-03-19_03:49:26.68142 fatal: packfile ./objects/pack/pack-d37a0c64e9ce1c8b29ad9c02a39636ca9c609c31.pack cannot be mapped.

Invitation

Anybody who wants to participate in the ongoing design, development and
debugging of lvm3, among other things: we hang out on irc.oftc.net
#zumastor. Everybody welcome, and see http://zumastor.org.

Regards,

Daniel

[1] Graydon Hoare: see quicksort, then see grandfather.

[2] Fortunately, Graydon did implement the first suggestion, that
directories should become first class versioned objects, thus setting
the stage for the development of git[3].

[3] Monotone, http://en.wikipedia.org/wiki/Monotone_%28software%29


2008-03-19 22:00:56

by Mike Snitzer

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On 3/19/08, Daniel Phillips <[email protected]> wrote:
> Hi all,
>
> I have set up a new git tree to better serve the needs of those
> interested in advanced storage applications and development.

Thanks for formalizing a repo for all the interesting work you do.

> This ddtree git repository aims to provide a congenial forum for
> development of forward looking storage features such as replication and
> clustering; and to provide improved kernels for those who consider it
> important that their storage servers run efficiently under heavy load
> without deadlocking.
>
> What will be in this ddtree?
>
> * Block layer deadlock fixes (Status: production)

Do you happen to have a pointer to where these block layer deadlock
fixes are? Or will you be committing them shortly?

regards,
Mike

2008-03-19 22:17:56

by Daniel Phillips

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Wednesday 19 March 2008 13:23, Mike Snitzer wrote:
> > * Block layer deadlock fixes (Status: production)
>
> Do you happen to have a pointer to where these block layer deadlock
> fixes are? Or will you be committing them shortly?

Hi Mike,

Committing shortly. They also live here:

http://code.google.com/p/zumastor/source/browse/trunk/ddsnap/patches/2.6.24.2/bio.throttle.patch
...etc...

Regards,

Daniel

2008-03-20 00:15:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Wednesday 19 March 2008 13:23, Mike Snitzer wrote:
> > * Block layer deadlock fixes (Status: production)
>
> Do you happen to have a pointer to where these block layer deadlock
> fixes are? Or will you be committing them shortly?

Hi Mike,

OK, this is committed now, but caveat: improved, untested except for
booting. But what could possibly go wrong? :-/

http://phunq.net/ddtree?p=ddtree/.git;a=blob;f=patches/bio-throttle

The production version is sitting in the code.google.com svn repository
in ddsnap/patches/2.6.23.8. That one has a known bug that has somehow
escaped being stomped with a new commit, since it only manifests if you
stack one stacking block device on top of another one. I will post here
when we have an official, torture tested version of the patch.

The patch above is improved from the most recently posted version by
using using the ->bi_max_vecs field for throttle accounting instead of
calling out to a per-driver metric. This works nicely because the
max_vecs field cannot change during the life of the bio, and it gives
a decent upper bound on the resource consumption of the bio, better
than simply counting bios in flight. The queue->metric() method is
still in there as a stub, some more cleanup to do there (and further
shrinking of the patch). It does no harm.

This improvement shrinks the throttled version of struct bio by 4
bytes.

Regards,

Daniel

2008-03-20 00:33:44

by Mike Snitzer

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Wed, Mar 19, 2008 at 7:33 PM, Daniel Phillips <[email protected]> wrote:
> On Wednesday 19 March 2008 13:23, Mike Snitzer wrote:
>
> > > * Block layer deadlock fixes (Status: production)
> >
> > Do you happen to have a pointer to where these block layer deadlock
> > fixes are? Or will you be committing them shortly?
>
> Hi Mike,
>
> OK, this is committed now, but caveat: improved, untested except for
> booting. But what could possibly go wrong? :-/
>
> http://phunq.net/ddtree?p=ddtree/.git;a=blob;f=patches/bio-throttle
>
> The production version is sitting in the code.google.com svn repository
> in ddsnap/patches/2.6.23.8. That one has a known bug that has somehow
> escaped being stomped with a new commit, since it only manifests if you
> stack one stacking block device on top of another one. I will post here
> when we have an official, torture tested version of the patch.

You mean like LVM2 LV ontop of MD? Or stacking purely DM-based
stacked devices (Maybe LVM2 LV ontop of mpath? or dm-crypt on LVM2?).

> The patch above is improved from the most recently posted version by
> using using the ->bi_max_vecs field for throttle accounting instead of
> calling out to a per-driver metric. This works nicely because the
> max_vecs field cannot change during the life of the bio, and it gives
> a decent upper bound on the resource consumption of the bio, better
> than simply counting bios in flight. The queue->metric() method is
> still in there as a stub, some more cleanup to do there (and further
> shrinking of the patch). It does no harm.
>
> This improvement shrinks the throttled version of struct bio by 4
> bytes.

Cool, so I looked briefly at the ddsnap DM target some time ago and
saw that it needed to take special care to leverage this particular
throttle (I think this was the per-driver metric?). My memory is
fuzzy on that but what I'm wondering is how "general" is this new
patch? Do additional steps need to be taken to be able to _really_
guarantee devices won't deadlock?

I typically use dm-linear devices built on MD (raid1 w/ one member
being remote via nbd). The per-bdi dirty writeback accounting has
proven useful but I've recently hit a nasty livelock when the bdi
accounting for a device no longer enables writeback progress to be
made, e.g:

BdiWriteback: 0 kB
BdiReclaimable: 321408 kB
BdiDirtyThresh: 316364 kB
DirtyThresh: 381284 kB
BackgroundThresh: 190640 kB

With an all too familiar trace like the following:
..
[<ffffffff8044cda6>] io_schedule_timeout+0x4b/0x79
[<ffffffff80271371>] congestion_wait+0x66/0x80
[<ffffffff802457bd>] autoremove_wake_function+0x0/0x2e
[<ffffffff8026c64d>] balance_dirty_pages_ratelimited_nr+0x21d/0x2b1
[<ffffffff80268191>] generic_file_buffered_write+0x5f3/0x711

I'm _hoping_ your simple/elegant patch can enable me to drop my 2.6.22
per-bdi backport and all will be right with the world.

What do you think?

Mike

2008-03-20 00:54:37

by Mike Snitzer

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Wed, Mar 19, 2008 at 8:07 PM, Mike Snitzer <[email protected]> wrote:

> I typically use dm-linear devices built on MD (raid1 w/ one member
> being remote via nbd). The per-bdi dirty writeback accounting has
> proven useful but I've recently hit a nasty livelock when the bdi
> accounting for a device no longer enables writeback progress to be
> made, e.g:
>
> BdiWriteback: 0 kB
> BdiReclaimable: 321408 kB
> BdiDirtyThresh: 316364 kB
> DirtyThresh: 381284 kB
> BackgroundThresh: 190640 kB
>
> With an all too familiar trace like the following:
> ..
> [<ffffffff8044cda6>] io_schedule_timeout+0x4b/0x79
> [<ffffffff80271371>] congestion_wait+0x66/0x80
> [<ffffffff802457bd>] autoremove_wake_function+0x0/0x2e
> [<ffffffff8026c64d>] balance_dirty_pages_ratelimited_nr+0x21d/0x2b1
> [<ffffffff80268191>] generic_file_buffered_write+0x5f3/0x711
>
> I'm _hoping_ your simple/elegant patch can enable me to drop my 2.6.22
> per-bdi backport and all will be right with the world.

Turns out my writeback woes were actually in... writeback ;)

I had backported Fengweng Wu's writeback work from 2.6.24 to 2.6.22
with the following commits/patches:

fix time ordering of the per superblock inode lists:
6610a0bc8dcc120daa1d93807d470d5cbf777c39
9852a0e76cd9c89e71f84e784212fdd7a97ae93a
f57b9b7b4f68e1723ca99381dc10c8bc07d6df14
c986d1e2a460cbce79d631c51519ae82c778c6c5
1b43ef91d40190b16ba10218e66d5c2c4ba11de3
c6945e77e477103057b4a639b4b01596f5257861
65cb9b47e0ea568a7a38cce7773052a6ea093629
670e4def6ef5f44315d62748134e535b479c784f
2c1365791048e8aff42138ed5f6040b3c7824a69

fix periodic superblock dirty inode flushing:
0e0f4fc22ece8e593167eccbb1a4154565c11faa

remove pages_skipped accounting in __block_write_full_page():
1f7decf6d9f06dac008b8d66935c0c3b18e564f9

speed up writeback of big dirty files:
http://lkml.org/lkml/2008/1/17/49

Given all but the last lkml.org patch reference went into 2.6.24 I'll
revisit the impact these changes have on my writeback progress when I
move to 2.6.24+.

Mike

2008-03-21 14:05:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Wed 2008-03-19 00:02:24, Daniel Phillips wrote:
> Hi all,
>
> I have set up a new git tree to better serve the needs of those
> interested in advanced storage applications and development.
>
> This ddtree git repository aims to provide a congenial forum for
> development of forward looking storage features such as replication and
> clustering; and to provide improved kernels for those who consider it
> important that their storage servers run efficiently under heavy load
> without deadlocking.
>
> What will be in this ddtree?
>
> * Block layer deadlock fixes (Status: production)

Sounds like something that should be sent to linus, not hidden in git
tree somewhere?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-03-21 18:14:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: [ANNOUNCE] ddtree: A git kernel tree for storage servers

On Friday 21 March 2008 05:49, Pavel Machek wrote:
> > What will be in this ddtree?
> >
> > * Block layer deadlock fixes (Status: production)
>
> Sounds like something that should be sent to linus, not hidden in git
> tree somewhere?

Various other artists think they have the One True Solution and I got
tired of arguing, so I thought I would just make it available in my
tree and when folks realize they need it they know where to get it.

Trying not to hide it...

Daniel