LinuxLists.cc - Re: [PATCH 0/6] cpuset aware writeback

2007-09-12 01:32:59

Subject: Re: [PATCH 0/6] cpuset aware writeback

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

For testing I did a variety of basic tests, verifying individual
features of the test. To verify that it fixes the core problem, I
created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.

Using iostat I can see the initial period of reading from disk, followed
by a period of simultaneous reads and writes as dirty bytes are pushed
to make room for new reads.

In a separate log-in, in the other cpuset, I am running:

while `true`; do date | tee -a date.txt; sleep 5; done

date.txt resides on the same disk as the large file mentioned above. The
above while-loop serves the dual purpose of providing me visual clues of
progress along with the opportunity for the "tee" command to become
throttled writing to the disk.

The effect of this patchset is straightforward. Without it there are
long hangs between appearances of the date. With it the dates are all 5
(or sometimes 6) seconds apart.

I also added printks to the kernel to verify that, without these
patches, the tee was being throttled (along with lots of other things),
and with the patch only pdflush is being throttled.

These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.

2007-09-12 01:37:23

On Tue, 18 Sep 2007, Ethan Solomita wrote:

> > Does it have to be atomic? atomic is weak and can fail.
> >
> > If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> > should at least pass the gfp_t into this function so it can do the stronger
> > allocation when possible.
>
> I was going to say that sanity would be improved by just allocing the
> nodemask at inode alloc time. A failure here could be a problem because
> below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask
> pointer means that there are no dirty nodes, thus preventing dirty pages
> from getting written to disk. i.e. This must never fail.

Hmmm. It should assume that there is no tracking thus any node can be
dirty? Match by default?

> Given that we allocate it always at the beginning, I'm leaning towards
> just allocating it within mapping no matter its size. It will make the
> code much much simpler, and save me writing all the comments we've been
> discussing. 8-)
>
> How disastrous would this be? Is the need to support a 1024 node system
> with 1,000,000 open mostly-read-only files thus needing to spend 120MB
> of extra memory on my nodemasks a real scenario and a showstopper?

Consider that a 1024 node system has more than 4TB of memory. If that
system is running as a fileserver then you get into some issues. But then
120MB are not that big of a deal. Its more the cache footprint issue I
would think. Having a NULL there avoids touching a 128 byte nodemask. I
think your approach should be fine.

> >> +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> >> +{
> >> + nodemask_t *nodes = mapping->dirty_nodes;
> >> +
> >> + if (nodes) {
> >> + mapping->dirty_nodes = NULL;
> >> + kfree(nodes);
> >> + }
> >> +}
> >
> > Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> > a comment which describes the locking requirements would be good.
>
> I'll add a comment. Such a race should not be possible. It is called
> only from clear_inode() which is used when the inode is being freed
> "with extreme prejudice" (from its comments). I can add a check that
> i_state I_FREEING is set. Would that do?

There is already a comment saying that it cannot happen.

2007-09-19 17:08:42

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH 1/6] cpuset write dirty map

On Tue, 18 Sep 2007, Andrew Morton wrote:

> How hard would it be to handle the allocation failure in a more friendly
> manner? Say, if the allocation failed then point mapping->dirty_nodes at
> some global all-ones nodemask, and then special-case that nodemask in the
> freeing code?

Ack. However, the situation dirty_nodes == NULL && inode dirty then means
that unknown nodes are dirty. If we are later are successful with the
alloc and we know that the pages are dirty in the mapping then the initial
dirty_nodes must be all ones. If this is the first page to be dirtied then
we can start with a dirty_nodes mask of all zeros like now.