Reduce log I/O latency
To ensure that log I/O is issued as the highest priority I/O, set
the I/O priority of the log I/O to the highest possible. This will
ensure that log I/O is not held up behind bulk data or other
metadata I/O as delaying log I/O can pause the entire transaction
subsystem. Introduce a new buffer flag to allow us to tag the log
buffers so we can discrimiate when issuing the I/O.
Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_buf.c | 3 +++
fs/xfs/linux-2.6/xfs_buf.h | 5 ++++-
fs/xfs/xfs_log.c | 2 ++
3 files changed, 9 insertions(+), 1 deletion(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-11-22 10:47:21.937396362 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-11-22 10:53:11.556186722 +1100
@@ -1255,6 +1255,9 @@ next_chunk:
submit_io:
if (likely(bio->bi_size)) {
+ /* log I/O should not be delayed by anything. */
+ if (bp->b_flags & XBF_LOG_BUFFER)
+ bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 0));
submit_bio(rw, bio);
if (size)
goto next_chunk;
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.h 2007-11-22 10:47:21.945395328 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.h 2007-11-22 10:53:11.556186722 +1100
@@ -53,7 +53,8 @@ typedef enum {
XBF_DELWRI = (1 << 6), /* buffer has dirty pages */
XBF_STALE = (1 << 7), /* buffer has been staled, do not find it */
XBF_FS_MANAGED = (1 << 8), /* filesystem controls freeing memory */
- XBF_ORDERED = (1 << 11), /* use ordered writes */
+ XBF_LOG_BUFFER = (1 << 9), /* Buffer issued by the log */
+ XBF_ORDERED = (1 << 11), /* use ordered writes */
XBF_READ_AHEAD = (1 << 12), /* asynchronous read-ahead */
/* flags used only as arguments to access routines */
@@ -340,6 +341,8 @@ extern void xfs_buf_trace(xfs_buf_t *, c
#define XFS_BUF_TARGET(bp) ((bp)->b_target)
#define XFS_BUFTARG_NAME(target) xfs_buf_target_name(target)
+#define XFS_BUF_SET_LOGBUF(bp) ((bp)->b_flags |= XBF_LOG_BUFFER)
+
static inline int xfs_bawrite(void *mp, xfs_buf_t *bp)
{
bp->b_fspriv3 = mp;
Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-11-22 10:47:21.945395328 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-11-22 10:53:11.556186722 +1100
@@ -1443,6 +1443,8 @@ xlog_sync(xlog_t *log,
XFS_BUF_ZEROFLAGS(bp);
XFS_BUF_BUSY(bp);
XFS_BUF_ASYNC(bp);
+ XFS_BUF_SET_LOGBUF(bp);
+
/*
* Do an ordered write for the log block.
* Its unnecessary to flush the first split block in the log wrap case.
David Chinner <[email protected]> writes:
> To ensure that log I/O is issued as the highest priority I/O, set
> the I/O priority of the log I/O to the highest possible. This will
> ensure that log I/O is not held up behind bulk data or other
> metadata I/O as delaying log I/O can pause the entire transaction
> subsystem. Introduce a new buffer flag to allow us to tag the log
> buffers so we can discrimiate when issuing the I/O.
Won't that possible disturb other RT priority users that do not need
log IO (e.g. working on preallocated files)? Seems a little
dangerous.
I suspect you want a "higher than bulk but lower than RT" priority
for this really unless there is any block RT priority task waiting
for log IO (but keeping track of the later might be tricky)
-Andi
On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote:
> David Chinner <[email protected]> writes:
>
> > To ensure that log I/O is issued as the highest priority I/O, set
> > the I/O priority of the log I/O to the highest possible. This will
> > ensure that log I/O is not held up behind bulk data or other
> > metadata I/O as delaying log I/O can pause the entire transaction
> > subsystem. Introduce a new buffer flag to allow us to tag the log
> > buffers so we can discrimiate when issuing the I/O.
>
> Won't that possible disturb other RT priority users that do not need
> log IO (e.g. working on preallocated files)? Seems a little
> dangerous.
In all the cases that I know of where ppl are using what could
be considered real-time I/O (e.g. media environments where they
do real-time ingest and playout from the same filesystem) the
real-time ingest processes create the files and do pre-allocation
before doing their I/O. This I/O can get held up behind another
process that is not real time that has issued log I/O.
Given there is no I/O priority inheritence and having log I/O stall
will stall the entire filesystem, we cannot allow log I/O to
stall in real-time environments. Hence it must have the highest
possible priority to prevent this.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> On Thu, Nov 22, 2007 at 01:49:25AM +0100, Andi Kleen wrote:
> > David Chinner <[email protected]> writes:
> >
> > > To ensure that log I/O is issued as the highest priority I/O, set
> > > the I/O priority of the log I/O to the highest possible. This will
> > > ensure that log I/O is not held up behind bulk data or other
> > > metadata I/O as delaying log I/O can pause the entire transaction
> > > subsystem. Introduce a new buffer flag to allow us to tag the log
> > > buffers so we can discrimiate when issuing the I/O.
> >
> > Won't that possible disturb other RT priority users that do not need
> > log IO (e.g. working on preallocated files)? Seems a little
> > dangerous.
>
> In all the cases that I know of where ppl are using what could
> be considered real-time I/O (e.g. media environments where they
> do real-time ingest and playout from the same filesystem) the
> real-time ingest processes create the files and do pre-allocation
> before doing their I/O. This I/O can get held up behind another
> process that is not real time that has issued log I/O.
>
> Given there is no I/O priority inheritence and having log I/O stall
> will stall the entire filesystem, we cannot allow log I/O to
> stall in real-time environments. Hence it must have the highest
> possible priority to prevent this.
I've seen PVRs that would be upset by this. They put media on one
filesystem and database/apps/swap/etc. on another, but have everything
on a single spindle. Stalling a media filesystem read for a write
anywhere else = fail.
--
Mathematics is the supreme nostalgia of our time.
On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote:
> On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> > In all the cases that I know of where ppl are using what could
> > be considered real-time I/O (e.g. media environments where they
> > do real-time ingest and playout from the same filesystem) the
> > real-time ingest processes create the files and do pre-allocation
> > before doing their I/O. This I/O can get held up behind another
> > process that is not real time that has issued log I/O.
> >
> > Given there is no I/O priority inheritence and having log I/O stall
> > will stall the entire filesystem, we cannot allow log I/O to
> > stall in real-time environments. Hence it must have the highest
> > possible priority to prevent this.
>
> I've seen PVRs that would be upset by this. They put media on one
> filesystem and database/apps/swap/etc. on another, but have everything
> on a single spindle. Stalling a media filesystem read for a write
> anywhere else = fail.
Sounds like the PVR is badly designed to me. If a write can cause a
read to miss a playback deadline, then you haven't built enough
buffering into your playback application.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, 2007-11-22 at 12:12 +1100, David Chinner wrote:
> In all the cases that I know of where ppl are using what could
> be considered real-time I/O (e.g. media environments where they
> do real-time ingest and playout from the same filesystem) the
> real-time ingest processes create the files and do pre-allocation
> before doing their I/O. This I/O can get held up behind another
> process that is not real time that has issued log I/O.
>
> Given there is no I/O priority inheritence and having log I/O stall
> will stall the entire filesystem, we cannot allow log I/O to
> stall in real-time environments. Hence it must have the highest
> possible priority to prevent this.
FWIW from a "real time" database POV this seems to make sense to me...
in fact, we probably rely on filesystem metadata way too much
(historically it's just "worked".... although we do seem to get issues
on ext3).
I have a (casually stupid) simulation program... although I've observed
little to no problems on all my XFS tests using it.
--
Stewart Smith, Senior Software Engineer (MySQL Cluster)
MySQL AB, http://www.mysql.com
Office: +14082136540 Ext: 6616
VoIP: [email protected]
Mobile: +61 4 3 8844 332
On Thu, Nov 22, 2007 at 02:41:06PM +1100, David Chinner wrote:
> On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote:
> > On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> > > In all the cases that I know of where ppl are using what could
> > > be considered real-time I/O (e.g. media environments where they
> > > do real-time ingest and playout from the same filesystem) the
> > > real-time ingest processes create the files and do pre-allocation
> > > before doing their I/O. This I/O can get held up behind another
> > > process that is not real time that has issued log I/O.
> > >
> > > Given there is no I/O priority inheritence and having log I/O stall
> > > will stall the entire filesystem, we cannot allow log I/O to
> > > stall in real-time environments. Hence it must have the highest
> > > possible priority to prevent this.
> >
> > I've seen PVRs that would be upset by this. They put media on one
> > filesystem and database/apps/swap/etc. on another, but have everything
> > on a single spindle. Stalling a media filesystem read for a write
> > anywhere else = fail.
>
> Sounds like the PVR is badly designed to me. If a write can cause a
> read to miss a playback deadline, then you haven't built enough
> buffering into your playback application.
Normally it's not a problem. But your proposed change can push a
working system into a non-working system by making non-critical I/O on
an unrelated filesystem have higher priority than the thing that -actually
has real-time constraints-.
In other words, I/O priority is per-spindle and not per-filesystem and
thus this change has consequences that leak outside the filesystem in
question. That's bad.
I'd further add that the kernel internals probably shouldn't wander
into RT priority levels unless it's actually doing priority
inheritance, otherwise it's quite likely to upset the careful
considerations of the RT system designer's priority schemes. For
instance, a log-heavy but otherwise non-RT load with this patch could
possibly completely starve direct I/O to another partition even though
it's marked RT, thus livelocking the system.
To the general PVR problem: they typically want to work with a minimum
of buffering to maximize responsiveness to user commands (fast
forward, jump 30 seconds, play in reverse). Now consider that you're
recording and playing back multiple HD streams on low-margin set-top
hardware and you'll see that making this work -at all- means lots of
I/O tuning.
--
Mathematics is the supreme nostalgia of our time.
On Thu, Nov 22, 2007 at 01:25:49AM -0600, Matt Mackall wrote:
> On Thu, Nov 22, 2007 at 02:41:06PM +1100, David Chinner wrote:
> > On Wed, Nov 21, 2007 at 08:57:27PM -0600, Matt Mackall wrote:
> > > On Thu, Nov 22, 2007 at 12:12:14PM +1100, David Chinner wrote:
> > > > In all the cases that I know of where ppl are using what could
> > > > be considered real-time I/O (e.g. media environments where they
> > > > do real-time ingest and playout from the same filesystem) the
> > > > real-time ingest processes create the files and do pre-allocation
> > > > before doing their I/O. This I/O can get held up behind another
> > > > process that is not real time that has issued log I/O.
> > > >
> > > > Given there is no I/O priority inheritence and having log I/O stall
> > > > will stall the entire filesystem, we cannot allow log I/O to
> > > > stall in real-time environments. Hence it must have the highest
> > > > possible priority to prevent this.
> > >
> > > I've seen PVRs that would be upset by this. They put media on one
> > > filesystem and database/apps/swap/etc. on another, but have everything
> > > on a single spindle. Stalling a media filesystem read for a write
> > > anywhere else = fail.
> >
> > Sounds like the PVR is badly designed to me. If a write can cause a
> > read to miss a playback deadline, then you haven't built enough
> > buffering into your playback application.
>
> Normally it's not a problem. But your proposed change can push a
> working system into a non-working system by making non-critical I/O on
> an unrelated filesystem have higher priority than the thing that -actually
> has real-time constraints-.
>
> In other words, I/O priority is per-spindle and not per-filesystem and
> thus this change has consequences that leak outside the filesystem in
> question. That's bad.
This has nothing to do with this patch - it's a problem with sharing
a single resource in a RT system between two non-deterministic
constructs. e.g. I can put two ext3 filesystems on the one spindle,
run two completely independent RT workloads on the different
filesystems and have one workload DOS the other due to differences
in priority at the spindle.
That's not a bug in ext3 or the I/O priority mechanism - that's bad
system design. Put the filesystems on different spindles and the
problem goes away.
> I'd further add that the kernel internals probably shouldn't wander
> into RT priority levels unless it's actually doing priority
> inheritance, otherwise it's quite likely to upset the careful
> considerations of the RT system designer's priority schemes.
Even if issuing RT I/O will guarantee problems in a RT system?
We've put this cool RT I/O prioritisation mechanism in the I/O layer
without any consideration of what it means for the filesystems that
the I/O must pass through first. The design defines I/O
prioritsation from a *process* POV and it ignores the fact that the
filesystem might not work effectively under such prioritisation
mechanism.
An example, perhaps.
If you're smart about the way your application does its multi-stream
RT write I/O you preallocate the space and use direct I/O. But even
though you've preallocated the space, in XFS you still need
transactions to work because you have to mark the extent you just
wrote to as written.
This conversion happens during I/O completion (i.e. in a workqueue)
so it doesn't have the *process* priority to force out log I/O at the same
priority as the RT thread. Hence once all the log buffers are queued
for I/O, the transaction system blocks all the I/O completion workqueues
and all the RT write I/O stops completing and your application, which
is doing synchronous direct I/O into preallocated regions hangs.....
Hence the only way to give the log I/O enough priority to be issued
is to give the I/O a higher priority than anything that is running
at the time.
This is not a problem the I/O scheduler can solve - it is a result
of the mechanism used to transfer priority from process context to
I/o context. The needs of the filesystem is the key thing that is
missing here - you can't do RT I/O if the filesystem backs up....
> Now consider that you're
> recording and playing back multiple HD streams on low-margin set-top
> hardware and you'll see that making this work -at all- means lots of
> I/O tuning.
Yes, it does. But along the same lines, sustaining multiple
uncompressed 2k and 4k streams (i.e. multiple GB/s of throughput)
takes a lot of I/O tuning. We had to design a whole new allocator to
tune the I/O patterns to make it work....
Basically, we're not optimising XFS for small, embedded systems. We
are at the other end of the scale - XFS is optimised for very large,
very expensive storage subsystems and hence we often do things that
don't make sense for embedded systems...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
> FWIW from a "real time" database POV this seems to make sense to me...
> in fact, we probably rely on filesystem metadata way too much
> (historically it's just "worked".... although we do seem to get issues
> on ext3).
For that case you really would need priority inheritance: any metadata
IO on behalf or blocking a process needs to use the process' block IO
priority.
David's change just fixes a limited set of cases, but breaks others.
-Andi
On Thu, Nov 22, 2007 at 01:06:11PM +0100, Andi Kleen wrote:
> > FWIW from a "real time" database POV this seems to make sense to me...
> > in fact, we probably rely on filesystem metadata way too much
> > (historically it's just "worked".... although we do seem to get issues
> > on ext3).
>
> For that case you really would need priority inheritance: any metadata
> IO on behalf or blocking a process needs to use the process' block IO
> priority.
How do you do that when the processes are blocking on semaphores,
mutexes or rw-semaphores in the fileysystem three layers removed from
the I/O in progress?
e.g. a low priority process transaction is holding the AGF buffer
locked but the transaction is blocked waiting for some other
metadata I/O it has issued needed in the transaction. That metadata
I/O is being held out by a higher priority process doing lots of
I/O.
Another process at the same priority creates a file, requiring
inodes to be allocated so it locks the directory into the
transaction and later blocks on the AGF buffer semaphore trying to
allocate space for the new inode.
A very high priority process now comes along and tries to read the
directory locked in the create transaction, and blocks on the
directory inode ilock because it's already held in write mode.
That's three processes all blocked on locks unrelated to the I/O
that is being held out, and there is no direct connection that can
be used to pass the priority down to the blocked I/O that is causing
all the problems.....
It's a Bad Idea.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, Nov 22, 2007 at 09:31:59PM +1100, David Chinner wrote:
[...]
> > In other words, I/O priority is per-spindle and not per-filesystem and
> > thus this change has consequences that leak outside the filesystem in
> > question. That's bad.
>
> This has nothing to do with this patch - it's a problem with sharing
> a single resource in a RT system between two non-deterministic
> constructs. e.g. I can put two ext3 filesystems on the one spindle,
> run two completely independent RT workloads on the different
> filesystems and have one workload DOS the other due to differences
> in priority at the spindle.
Sure. And it's up to the RT system designer not to do something stupid
like that. The problem is that your patch potentially promotes a
non-RT I/O activity to an RT one without regard to the rest of the
system.
Stop for a moment and look at all the kernel threads on the system. If
your argument was sensible, we'd have raised various of these threads
to (CPU) RT ages ago. But if we do, we actually royally foul up things
that have tried to carefully isolate themselves from SCHED_NORMAL
tasks. If a kernel thread preempted our watchdog or our data
collection process because it was trying to service a lower priority
task, that would be fatally broken.
As kernel engineers, we -do not know- the absolute importance of a
given subsystem in the wider scheme of things. Thus we have no
business promoting anything outside the "normal" range into RT unless
explicitly asked to (eg chrt) or if we're actually doing real deadlock
avoidance.
> That's not a bug in ext3 or the I/O priority mechanism - that's bad
> system design. Put the filesystems on different spindles and the
> problem goes away.
And so do all your PVR sales. Two spindles is economically impossible
on a set-top PVR.
And I rather expect this all also applies to having XFS volumes on top
of LVM + RAID5 along with other filesystems but I haven't looked
closely.
> > I'd further add that the kernel internals probably shouldn't wander
> > into RT priority levels unless it's actually doing priority
> > inheritance, otherwise it's quite likely to upset the careful
> > considerations of the RT system designer's priority schemes.
>
> Even if issuing RT I/O will guarantee problems in a RT system?
Absolutely (unless we're actually going to do priority inheritance).
The only person who can know the real RT requirements of a system is
the system's designer. If he wants to boost the priority of XFS I/O
threads into RT, he should be allowed to, but it shouldn't happen
automatically.
Consider someone concurrently running a database on a filesystem and
an RT data collection task direct to a separate partition. RT I/O may
currently allow them to successfully
> We've put this cool RT I/O prioritisation mechanism in the I/O layer
> without any consideration of what it means for the filesystems that
> the I/O must pass through first. The design defines I/O
> prioritsation from a *process* POV and it ignores the fact that the
> filesystem might not work effectively under such prioritisation
> mechanism.
>
> An example, perhaps.
>
> If you're smart about the way your application does its multi-stream
> RT write I/O you preallocate the space and use direct I/O. But even
> though you've preallocated the space, in XFS you still need
> transactions to work because you have to mark the extent you just
> wrote to as written.
>
> This conversion happens during I/O completion (i.e. in a workqueue)
> so it doesn't have the *process* priority to force out log I/O at the same
> priority as the RT thread. Hence once all the log buffers are queued
> for I/O, the transaction system blocks all the I/O completion workqueues
> and all the RT write I/O stops completing and your application, which
> is doing synchronous direct I/O into preallocated regions hangs.....
Perfectly understood. And that's fine. A system designer is allowed to
shoot himself in the foot.
> Hence the only way to give the log I/O enough priority to be issued
> is to give the I/O a higher priority than anything that is running
> at the time.
>
> This is not a problem the I/O scheduler can solve - it is a result
> of the mechanism used to transfer priority from process context to
> I/o context. The needs of the filesystem is the key thing that is
> missing here - you can't do RT I/O if the filesystem backs up....
I don't think there's any fundamental reason the I/O subsystem or
filesystems can't be taught to handle priority inversion, which is
much more acceptable and general fix.
> > Now consider that you're
> > recording and playing back multiple HD streams on low-margin set-top
> > hardware and you'll see that making this work -at all- means lots of
> > I/O tuning.
>
> Yes, it does. But along the same lines, sustaining multiple
> uncompressed 2k and 4k streams (i.e. multiple GB/s of throughput)
> takes a lot of I/O tuning. We had to design a whole new allocator to
> tune the I/O patterns to make it work....
..which makes it fairly attractive to PVR folks until you go mucking
with RT behind their backs.
If I've got XFS on filesystems A and B on the same spindle (or volume
group?) and my real RT I/O takes place only on B, then I want log
flushing to happen in RT on B. But -never on A-. If I can do this with
a tunable, I'm perfectly happy.
--
Mathematics is the supreme nostalgia of our time.
On Thu, Nov 22, 2007 at 12:10:29PM -0600, Matt Mackall wrote:
> On Thu, Nov 22, 2007 at 09:31:59PM +1100, David Chinner wrote:
> [...]
> > > In other words, I/O priority is per-spindle and not per-filesystem and
> > > thus this change has consequences that leak outside the filesystem in
> > > question. That's bad.
> >
> > This has nothing to do with this patch - it's a problem with sharing
> > a single resource in a RT system between two non-deterministic
> > constructs. e.g. I can put two ext3 filesystems on the one spindle,
> > run two completely independent RT workloads on the different
> > filesystems and have one workload DOS the other due to differences
> > in priority at the spindle.
>
> Sure. And it's up to the RT system designer not to do something stupid
> like that. The problem is that your patch potentially promotes a
> non-RT I/O activity to an RT one without regard to the rest of the
> system.
So this:
http://marc.info/?l=linux-kernel&m=119247074517414&w=2
shouldn't be allowed, either? (rt kjournald for ext3)
> Perfectly understood. And that's fine. A system designer is allowed to
> shoot himself in the foot.
Ok. I'll point anyone that complains at you, Matt ;)
> I don't think there's any fundamental reason the I/O subsystem or
> filesystems can't be taught to handle priority inversion, which is
> much more acceptable and general fix.
See my reply to Andi.
> If I've got XFS on filesystems A and B on the same spindle (or volume
> group?) and my real RT I/O takes place only on B, then I want log
> flushing to happen in RT on B. But -never on A-. If I can do this with
> a tunable, I'm perfectly happy.
No, not another mount option. I'm just going to drop this one for
now...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Fri, Nov 23, 2007 at 09:29:09AM +1100, David Chinner wrote:
> On Thu, Nov 22, 2007 at 12:10:29PM -0600, Matt Mackall wrote:
> > If I've got XFS on filesystems A and B on the same spindle (or volume
> > group?) and my real RT I/O takes place only on B, then I want log
> > flushing to happen in RT on B. But -never on A-. If I can do this with
> > a tunable, I'm perfectly happy.
>
> No, not another mount option. I'm just going to drop this one for
> now...
Actually, I might change it to use the highest non-rt priority, which
would solve the latency issues in the normal cases and still leave
the RT rope dangling for those that want to use it.
Is that an acceptible compromise, Matt?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Fri, Nov 23, 2007 at 09:29:09AM +1100, David Chinner wrote:
> On Thu, Nov 22, 2007 at 12:10:29PM -0600, Matt Mackall wrote:
> > On Thu, Nov 22, 2007 at 09:31:59PM +1100, David Chinner wrote:
> > [...]
> > > > In other words, I/O priority is per-spindle and not per-filesystem and
> > > > thus this change has consequences that leak outside the filesystem in
> > > > question. That's bad.
> > >
> > > This has nothing to do with this patch - it's a problem with sharing
> > > a single resource in a RT system between two non-deterministic
> > > constructs. e.g. I can put two ext3 filesystems on the one spindle,
> > > run two completely independent RT workloads on the different
> > > filesystems and have one workload DOS the other due to differences
> > > in priority at the spindle.
> >
> > Sure. And it's up to the RT system designer not to do something stupid
> > like that. The problem is that your patch potentially promotes a
> > non-RT I/O activity to an RT one without regard to the rest of the
> > system.
>
> So this:
>
> http://marc.info/?l=linux-kernel&m=119247074517414&w=2
>
> shouldn't be allowed, either? (rt kjournald for ext3)
No, I think not. If a user wants to manually promote kjournald, that's fine.
> > Perfectly understood. And that's fine. A system designer is allowed to
> > shoot himself in the foot.
>
> Ok. I'll point anyone that complains at you, Matt ;)
>
> > I don't think there's any fundamental reason the I/O subsystem or
> > filesystems can't be taught to handle priority inversion, which is
> > much more acceptable and general fix.
>
> See my reply to Andi.
I did. And I'll admit it's pretty thorny and I certainly don't know
enough about XFS internals to comment further.
> > If I've got XFS on filesystems A and B on the same spindle (or volume
> > group?) and my real RT I/O takes place only on B, then I want log
> > flushing to happen in RT on B. But -never on A-. If I can do this with
> > a tunable, I'm perfectly happy.
>
> No, not another mount option. I'm just going to drop this one for
> now...
I was actually just suggesting allowing a user to do ioprio_set on the
appropriate kernel threads.
--
Mathematics is the supreme nostalgia of our time.
On Fri, Nov 23, 2007 at 10:09:22AM +1100, David Chinner wrote:
> On Fri, Nov 23, 2007 at 09:29:09AM +1100, David Chinner wrote:
> > On Thu, Nov 22, 2007 at 12:10:29PM -0600, Matt Mackall wrote:
> > > If I've got XFS on filesystems A and B on the same spindle (or volume
> > > group?) and my real RT I/O takes place only on B, then I want log
> > > flushing to happen in RT on B. But -never on A-. If I can do this with
> > > a tunable, I'm perfectly happy.
> >
> > No, not another mount option. I'm just going to drop this one for
> > now...
>
> Actually, I might change it to use the highest non-rt priority, which
> would solve the latency issues in the normal cases and still leave
> the RT rope dangling for those that want to use it.
>
> Is that an acceptible compromise, Matt?
Yes, that's perfectly fine.
--
Mathematics is the supreme nostalgia of our time.
On Fri, Nov 23, 2007 at 12:15:39AM +1100, David Chinner wrote:
> On Thu, Nov 22, 2007 at 01:06:11PM +0100, Andi Kleen wrote:
> > > FWIW from a "real time" database POV this seems to make sense to me...
> > > in fact, we probably rely on filesystem metadata way too much
> > > (historically it's just "worked".... although we do seem to get issues
> > > on ext3).
> >
> > For that case you really would need priority inheritance: any metadata
> > IO on behalf or blocking a process needs to use the process' block IO
> > priority.
>
> How do you do that when the processes are blocking on semaphores,
> mutexes or rw-semaphores in the fileysystem three layers removed from
> the I/O in progress?
[...] I didn't say it was easy (or rather explicitely said it would be tricky).
Probably it would be possible to fold it somehow into rt mutexes PI,
but it's not easy and semaphores would need to be handled too.
Just my point was to solve the metadata RT problem unconditionally increasing
the priority is a bad idea and not really a replacement to a "full"
solution. Short term a user can just increase the priority of all the XFS
threads anyways.
-Andi
On Fri, Nov 23, 2007 at 03:53:17AM +0100, Andi Kleen wrote:
> On Fri, Nov 23, 2007 at 12:15:39AM +1100, David Chinner wrote:
> > On Thu, Nov 22, 2007 at 01:06:11PM +0100, Andi Kleen wrote:
> > > > FWIW from a "real time" database POV this seems to make sense to me...
> > > > in fact, we probably rely on filesystem metadata way too much
> > > > (historically it's just "worked".... although we do seem to get issues
> > > > on ext3).
> > >
> > > For that case you really would need priority inheritance: any metadata
> > > IO on behalf or blocking a process needs to use the process' block IO
> > > priority.
> >
> > How do you do that when the processes are blocking on semaphores,
> > mutexes or rw-semaphores in the fileysystem three layers removed from
> > the I/O in progress?
>
> [...] I didn't say it was easy (or rather explicitely said it would be tricky).
> Probably it would be possible to fold it somehow into rt mutexes PI,
> but it's not easy and semaphores would need to be handled too.
>
> Just my point was to solve the metadata RT problem unconditionally increasing
> the priority is a bad idea and not really a replacement to a "full"
> solution. Short term a user can just increase the priority of all the XFS
> threads anyways.
The point is that it's not actually a thread-based problem - the priority
can't be inherited via the traditional mutex-like manner. There is no
connection between a thread and an I/o it has already issued and so you
can't transfer a priority from a blocked thread to an issued-but-blocked
i/o....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Fri, Nov 23, 2007 at 03:03:29PM +1100, David Chinner wrote:
> On Fri, Nov 23, 2007 at 03:53:17AM +0100, Andi Kleen wrote:
> > On Fri, Nov 23, 2007 at 12:15:39AM +1100, David Chinner wrote:
> > > On Thu, Nov 22, 2007 at 01:06:11PM +0100, Andi Kleen wrote:
> > > > > FWIW from a "real time" database POV this seems to make sense to me...
> > > > > in fact, we probably rely on filesystem metadata way too much
> > > > > (historically it's just "worked".... although we do seem to get issues
> > > > > on ext3).
> > > >
> > > > For that case you really would need priority inheritance: any metadata
> > > > IO on behalf or blocking a process needs to use the process' block IO
> > > > priority.
> > >
> > > How do you do that when the processes are blocking on semaphores,
> > > mutexes or rw-semaphores in the fileysystem three layers removed from
> > > the I/O in progress?
> >
> > [...] I didn't say it was easy (or rather explicitely said it would be tricky).
> > Probably it would be possible to fold it somehow into rt mutexes PI,
> > but it's not easy and semaphores would need to be handled too.
> >
> > Just my point was to solve the metadata RT problem unconditionally increasing
> > the priority is a bad idea and not really a replacement to a "full"
> > solution. Short term a user can just increase the priority of all the XFS
> > threads anyways.
>
> The point is that it's not actually a thread-based problem - the priority
> can't be inherited via the traditional mutex-like manner. There is no
> connection between a thread and an I/o it has already issued and so you
> can't transfer a priority from a blocked thread to an issued-but-blocked
> i/o....
It could be handled in theory similar to standard CPU priority inheritance -- \
keep track of IO priority of all threads you block and boost your IO priority
always to that level. But it would be probably not very easy to do.
-Andi
On Fri, Nov 23, 2007 at 01:01:15PM +0100, Andi Kleen wrote:
> On Fri, Nov 23, 2007 at 03:03:29PM +1100, David Chinner wrote:
> > On Fri, Nov 23, 2007 at 03:53:17AM +0100, Andi Kleen wrote:
> > > On Fri, Nov 23, 2007 at 12:15:39AM +1100, David Chinner wrote:
> > > > On Thu, Nov 22, 2007 at 01:06:11PM +0100, Andi Kleen wrote:
> > > > > > FWIW from a "real time" database POV this seems to make sense to me...
> > > > > > in fact, we probably rely on filesystem metadata way too much
> > > > > > (historically it's just "worked".... although we do seem to get issues
> > > > > > on ext3).
> > > > >
> > > > > For that case you really would need priority inheritance: any metadata
> > > > > IO on behalf or blocking a process needs to use the process' block IO
> > > > > priority.
> > > >
> > > > How do you do that when the processes are blocking on semaphores,
> > > > mutexes or rw-semaphores in the fileysystem three layers removed from
> > > > the I/O in progress?
> > >
> > > [...] I didn't say it was easy (or rather explicitely said it would be tricky).
> > > Probably it would be possible to fold it somehow into rt mutexes PI,
> > > but it's not easy and semaphores would need to be handled too.
> > >
> > > Just my point was to solve the metadata RT problem unconditionally increasing
> > > the priority is a bad idea and not really a replacement to a "full"
> > > solution. Short term a user can just increase the priority of all the XFS
> > > threads anyways.
> >
> > The point is that it's not actually a thread-based problem - the priority
> > can't be inherited via the traditional mutex-like manner. There is no
> > connection between a thread and an I/o it has already issued and so you
> > can't transfer a priority from a blocked thread to an issued-but-blocked
> > i/o....
>
> It could be handled in theory similar to standard CPU priority inheritance -- \
> keep track of IO priority of all threads you block and boost your IO priority
> always to that level. But it would be probably not very easy to do.
Well I think what Dave is saying is that we can't find the related
process. The submitter process may have even exited before the flush
happens.. You'd instead have to keep track of (the max of) all the
submitted I/O segment priorities related to the transaction instead.
But I'm sure there are complications.
--
Mathematics is the supreme nostalgia of our time.
> Index: 2.6.x-xfs-new/fs/xfs/xfs_log.c
> ===================================================================
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_log.c 2007-11-22 10:47:21.945395328 +1100
> +++ 2.6.x-xfs-new/fs/xfs/xfs_log.c 2007-11-22 10:53:11.556186722 +1100
> @@ -1443,6 +1443,8 @@ xlog_sync(xlog_t *log,
> XFS_BUF_ZEROFLAGS(bp);
> XFS_BUF_BUSY(bp);
> XFS_BUF_ASYNC(bp);
> + XFS_BUF_SET_LOGBUF(bp);
> +
> /*
> * Do an ordered write for the log block.
> * Its unnecessary to flush the first split block in the log wrap case.
Whichever way you go with this one Dave you should probably add another
XFS_BUF_SET_LOGBUF() call for the buffer split case further down in the
same function.