Hi,
I'm wondering if someone can tell me why sync_all_inodes() is called in
prune_icache().
sync_all_inodes() can cause problems in some situations when memory is
short and shrink_icache_memory() is called.
For instance, when the system is really short of memory,
do_try_to_free_pages() is invoked (either by application or kswapd) and
shrink_icache_memory() is also invoked, but when prune_icache() is called,
the first thing is does is to sync_all_inodes(). If the inode block is not
in memory, it may have to bread the inode block in, so the kswapd() can
block until the inode block is brought into memory. Not only that, since
the system is short of memory, there may not even be memory available for
the inode block. Even if there is, given that there is only a single kswapd
thread who is doing sync_all_inodes(), if the dirty inode list if
relatively long (like a tens of thousands as in something like SPEC SFS),
it'll take practically forever for sync_all_inodes() to finish. To user,
this looks like the system is hang (although it isn't really). It's just
taking a looooooong time to do shrink_icache_memory!
One solution to this is not to call sync_all_inodes() at all in
prune_icache(), since other parts of the kernel, like kupdate() will also
try to sync_inodes periodically anyway, but I don't know if this has other
implications or not. I don't see a problem with this myself. In fact, I
have been using this fix in my own test9 kernel, and I get much smoother
kernel behavior when running high load SPEC SFS than using the default
prune_icache(). Actually if sync_all_inodes() is called, SPEC SFS sometimes
simply fails due to the long response time on the I/O requests.
The similar theory goes with kupdate() daemon. That is, since there is only
a single thread that does the inode and buffer flushing, under high load,
kupdate() would not get a chance to call flush_dirty_buffers() until after
sync_inodes() is completed. But sync_inodes() can take forever since inodes
are flushed serially to disk. Imagine how long it might take if each inode
flushing causes one read from disk! In my experience with SPEC SFS,
sometimes, if kupdate() is invoked during the SPEC SFS run, it simply
cannot finish sync_inode() until the entire benchmark run is finished! So,
all the dirty buffers that flush_dirty_buffer(1) is supposed to flush would
never be called during the benchmark run and system is constantly running
in the bdflush() mode, which is really supposed to be called only in a
panic mode!
Again, the solution can be simple, one can create multiple
dirty_buffer_flushing daemon threads that calls flush_dirty_buffer()
without sync_super or sync_inode stuff. I have done so in my own test9
kernel, and the results with SPEC SFS is much more pleasant.
Ying
On Sat, Nov 11, 2000 at 11:01:25AM -0800, Ying Chen/Almaden/IBM wrote:
> try to sync_inodes periodically anyway, but I don't know if this has other
> implications or not. I don't see a problem with this myself. In fact, I
Not running sync_all_inodes() from prune_icache() has the only implication of
swapping out more stuff instead of freeing the icache (no risk to crash I
mean).
However I'm wondering what this SPEC SFS benchmark is doing to trigger
the icache shrink often.
> prune_icache(). Actually if sync_all_inodes() is called, SPEC SFS sometimes
> simply fails due to the long response time on the I/O requests.
Hmm, what do you mean with "fails"?
> The similar theory goes with kupdate() daemon. That is, since there is only
> a single thread that does the inode and buffer flushing, under high load,
> kupdate() would not get a chance to call flush_dirty_buffers() until after
> sync_inodes() is completed. But sync_inodes() can take forever since inodes
> are flushed serially to disk. Imagine how long it might take if each inode
inodes are not flushed serially to disk. Inodes are flushed to dirty buffer
cache.
the problem here isn't the flush, but it's the read that we need to do before
we can write to the buffer. The machine is probably writing heavily to disk
when that stall happens, and so the read latency is huge (even with the right
elevator latency numbers) while the I/O queue is full of writes.
> Again, the solution can be simple, one can create multiple
> dirty_buffer_flushing daemon threads that calls flush_dirty_buffer()
> without sync_super or sync_inode stuff. I have done so in my own test9
> kernel, and the results with SPEC SFS is much more pleasant.
kflushd should keep care to do the other work (flushing buffers) while the
machine is under write load so I'm a little surprised it makes much difference
to move the sync_inodes in a separate kernel thread (because from a certain
prospective it's _just_ in a separate kthread :).
That leads me to think kflushd may not be working properly but I've not yet
checked if something is changed in that area recently.
Or maybe you're writing to more than one disk at the same time and by having
two threads flushing buffers all the time in parallel allows you to keep both
disks busy (in that case that's sure not the right fix, but one possible right
fix is instead to have a per-blockdevice list of dirty buffers and to have one
kflushd per queue cloned at blkdev registration, I was discussing this
problematic with Jens a few weeks ago).
Andrea