My laptop drive seems to be waking up more often today and I suspect it's
somehow ext3/kjournald that's to blame. Does it obey the timings in
/proc/sys/vm/bdflush or does it have its own flush timer?
There's a more general problem with VM on laptops which is that the system
doesn't have any notion of spun-down disks. Flush intervals should be
short when the disk is running and long when it isn't and decisions about
which pages to discard or swap might be improvable. Pre-emptive swap when
the disk is spun down is a loss..
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Oliver Xymoron wrote:
>
> My laptop drive seems to be waking up more often today and I suspect it's
> somehow ext3/kjournald that's to blame. Does it obey the timings in
> /proc/sys/vm/bdflush or does it have its own flush timer?
It has its own flush timer. This is something we need to crunch
on and think about.
There's an untested patch here which may suffice.
> There's a more general problem with VM on laptops which is that the system
> doesn't have any notion of spun-down disks. Flush intervals should be
> short when the disk is running and long when it isn't and decisions about
> which pages to discard or swap might be improvable. Pre-emptive swap when
> the disk is spun down is a loss..
Yup. The current VM is a bit too swap-happy, IMO. In try_to_free_pages(),
replace `priority = DEF_PRIORITY' with `priority = DEF_PRIORITY + 2'.
Also, if we had appropriate hooks into the request layer, we could detect
when the disk was being spun up for a read, and opporunistically flush
out any pending writes.
Tell me if this is joyful:
--- linux-2.4.15/fs/buffer.c Thu Nov 22 23:02:58 2001
+++ linux-akpm/fs/buffer.c Fri Nov 23 17:21:04 2001
@@ -119,6 +119,12 @@ union bdflush_param {
int bdflush_min[N_PARAM] = { 0, 10, 5, 25, 0, 1*HZ, 0, 0, 0};
int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 6000*HZ, 100, 0, 0};
+int dirty_buffer_flush_interval(void)
+{
+ return bdf_prm.b_un.interval;
+}
+EXPORT_SYMBOL(dirty_buffer_flush_interval);
+
void unlock_buffer(struct buffer_head *bh)
{
clear_bit(BH_Wait_IO, &bh->b_state);
--- linux-2.4.15/fs/jbd/transaction.c Thu Nov 22 23:02:59 2001
+++ linux-akpm/fs/jbd/transaction.c Fri Nov 23 17:21:37 2001
@@ -43,6 +43,8 @@ extern spinlock_t journal_datalist_lock;
* processes trying to touch the journal while it is in transition.
*/
+extern int dirty_buffer_flush_interval(void);
+
static transaction_t * get_transaction (journal_t * journal, int is_try)
{
transaction_t * transaction;
@@ -56,7 +58,7 @@ static transaction_t * get_transaction (
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
transaction->t_tid = journal->j_transaction_sequence++;
- transaction->t_expires = jiffies + journal->j_commit_interval;
+ transaction->t_expires = jiffies + dirty_buffer_flush_interval();
/* Set up the commit timer for the new transaction. */
J_ASSERT (!journal->j_commit_timer_active);
On Fri, 23 Nov 2001, Andrew Morton wrote:
> Oliver Xymoron wrote:
> >
> > My laptop drive seems to be waking up more often today and I suspect it's
> > somehow ext3/kjournald that's to blame. Does it obey the timings in
> > /proc/sys/vm/bdflush or does it have its own flush timer?
>
> It has its own flush timer. This is something we need to crunch
> on and think about.
Ok. I think we'll probably end up needing per-device flush timers. Flushes
to jffs should work differently than flushes to disk, or to network
attached storage (iSCSI, nbd).
> > There's a more general problem with VM on laptops which is that the system
> > doesn't have any notion of spun-down disks. Flush intervals should be
> > short when the disk is running and long when it isn't and decisions about
> > which pages to discard or swap might be improvable. Pre-emptive swap when
> > the disk is spun down is a loss..
>
> Yup. The current VM is a bit too swap-happy, IMO. In try_to_free_pages(),
> replace `priority = DEF_PRIORITY' with `priority = DEF_PRIORITY + 2'.
>
> Also, if we had appropriate hooks into the request layer, we could detect
> when the disk was being spun up for a read, and opporunistically flush
> out any pending writes.
I think if the disk wakes up, then the time to next flush gets shortened
from long_interval to short_interval. If short_interval makes the next
flush in the past, it happens now. But if we sleep the disk and wake it up
immediately, we don't necessarily want to trigger a flush.
> Tell me if this is joyful:
Haven't tried it yet, but I'm afraid I don't see what makes it actually
sync with the dirty buffer flush. Wouldn't it be better to export a chain
of flush funcs hung off a timer?
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Oliver Xymoron wrote:
>
> > Tell me if this is joyful:
>
> Haven't tried it yet, but I'm afraid I don't see what makes it actually
> sync with the dirty buffer flush. Wouldn't it be better to export a chain
> of flush funcs hung off a timer?
It doesn't sync with kupdate.
If you want to do that, just defeat the journal timer altogether. So:
transaction->t_expires = jiffies + 1000000000;
in get_transaction(). That way, kupdate's write_super() will
run a commit every bdf_prm.b_un.interval jiffies.
-
On Fri, 23 Nov 2001, Andrew Morton wrote:
> Oliver Xymoron wrote:
> >
> > > Tell me if this is joyful:
> >
> > Haven't tried it yet, but I'm afraid I don't see what makes it actually
> > sync with the dirty buffer flush. Wouldn't it be better to export a chain
> > of flush funcs hung off a timer?
>
> It doesn't sync with kupdate.
>
> If you want to do that, just defeat the journal timer altogether. So:
>
> transaction->t_expires = jiffies + 1000000000;
>
> in get_transaction(). That way, kupdate's write_super() will
> run a commit every bdf_prm.b_un.interval jiffies.
Ok, so what's the theory behind the journal timer? Why would we want
ext3 journal flushed more or less often than ext2 metadata given that
they're of equivalent importance?
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Oliver Xymoron wrote:
>
> Ok, so what's the theory behind the journal timer? Why would we want
> ext3 journal flushed more or less often than ext2 metadata given that
> they're of equivalent importance?
umm, err.. If your machine crashes, ext3 will restore its state
to that which pertained between zero and five seconds before the crash.
With ext2+fsck, things are not as clear. Your data will be restored
to that which pertained from zero to thirty seconds prior to crash.
inodes and superblock to that which pertained from zero to thirty
five seconds before the crash, stuff like that.
A five second window is short enough for you to be confident that
everything you want is still there. With thirty seconds, uncertainty
creeps in.
Yes, it needs to be configurable.
-
On Sun, 25 Nov 2001, Andrew Morton wrote:
> Oliver Xymoron wrote:
> >
> > Ok, so what's the theory behind the journal timer? Why would we want
> > ext3 journal flushed more or less often than ext2 metadata given that
> > they're of equivalent importance?
>
> umm, err.. If your machine crashes, ext3 will restore its state
> to that which pertained between zero and five seconds before the crash.
>
> With ext2+fsck, things are not as clear. Your data will be restored
> to that which pertained from zero to thirty seconds prior to crash.
And that's my point exactly. In terms of integrity, each timer serves the
same purpose - get the filesystem on disk in sync with what's in memory.
Obviously ext3 does a better job of this than ext2 in terms of recovering
from partial transactions, but in both cases the flush is accomplishing
the same thing. I can see no a priori reason why the ext3 journal flush
would be timed differently than ext2 journal flush. If the flush time for
ext3 should be shorter, then so should the time for everything else. See?
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
On Fri, Nov 23, 2001 at 05:25:46PM -0800, Andrew Morton wrote:
> Also, if we had appropriate hooks into the request layer, we could detect
> when the disk was being spun up for a read, and opporunistically flush
> out any pending writes.
Actually you can't. SCSI spinup code isn't very useful anyway, and IDE disks
mostly handle spinup themselves. The kernel has too issue a reset to get a
disk back alive from sleep mode, but revival from standby doesn't involve
the kernel at all. When using the disk's internal timer, it isn't involved in
spindown either. Teaching the request layer about disk state might therefore
turn out to become rather messy, I suspect.
> Tell me if this is joyful:
[...]
> - transaction->t_expires = jiffies + journal->j_commit_interval;
> + transaction->t_expires = jiffies + dirty_buffer_flush_interval();
This change doesn't take care of kupdated's most interesting feature, i.e.
that you can entirely stop it (with a flush interval of zero and/or a
SIGSTOP). Now, if kjournald honoured SIGSTOP/SIGCONT, I could teach noflushd
to handle the spindown issue in userland. Uh, at least for one small detail:
Is there a way to tell which kjournald process is associated to which
partition? A fake cmdline, or an fd to the partition's device node that
shows up in /proc/<pid>/fd would indeed be quite helpful.
Regards,
Daniel.
Daniel Kobras wrote:
>
> On Fri, Nov 23, 2001 at 05:25:46PM -0800, Andrew Morton wrote:
> > Also, if we had appropriate hooks into the request layer, we could detect
> > when the disk was being spun up for a read, and opporunistically flush
> > out any pending writes.
>
> Actually you can't. SCSI spinup code isn't very useful anyway, and IDE disks
> mostly handle spinup themselves. The kernel has too issue a reset to get a
> disk back alive from sleep mode, but revival from standby doesn't involve
> the kernel at all. When using the disk's internal timer, it isn't involved in
> spindown either. Teaching the request layer about disk state might therefore
> turn out to become rather messy, I suspect.
Much simpler approach:
if (we're about to read from the disk) {
if (we have dirty data which is > 10 seconds old) {
write_it_now();
}
}
> > Tell me if this is joyful:
> [...]
> > - transaction->t_expires = jiffies + journal->j_commit_interval;
> > + transaction->t_expires = jiffies + dirty_buffer_flush_interval();
>
> This change doesn't take care of kupdated's most interesting feature, i.e.
> that you can entirely stop it (with a flush interval of zero and/or a
> SIGSTOP).
yup.
> Now, if kjournald honoured SIGSTOP/SIGCONT, I could teach noflushd
> to handle the spindown issue in userland. Uh, at least for one small detail:
> Is there a way to tell which kjournald process is associated to which
> partition? A fake cmdline, or an fd to the partition's device node that
> shows up in /proc/<pid>/fd would indeed be quite helpful.
Andreas has a patch which puts the device major/minor into kjournald's
process name.
Simply setting the journal timer to infinity happens to work out OK.
Commits are triggered by kupdate.
This is because kupdate's superblock writeout runs a commit. Because
ext3 is unable to distinguish it from a sys_sync(). Sigh.
-
On Tue, 27 Nov 2001, Daniel Kobras wrote:
> On Fri, Nov 23, 2001 at 05:25:46PM -0800, Andrew Morton wrote:
> > Also, if we had appropriate hooks into the request layer, we could detect
> > when the disk was being spun up for a read, and opporunistically flush
> > out any pending writes.
>
> Actually you can't. SCSI spinup code isn't very useful anyway, and IDE disks
> mostly handle spinup themselves. The kernel has too issue a reset to get a
> disk back alive from sleep mode, but revival from standby doesn't involve
> the kernel at all. When using the disk's internal timer, it isn't involved in
> spindown either. Teaching the request layer about disk state might therefore
> turn out to become rather messy, I suspect.
No messier than corrupted data --
> > Tell me if this is joyful:
> [...]
> > - transaction->t_expires = jiffies + journal->j_commit_interval;
> > + transaction->t_expires = jiffies + dirty_buffer_flush_interval();
>
> This change doesn't take care of kupdated's most interesting feature, i.e.
> that you can entirely stop it (with a flush interval of zero and/or a
> SIGSTOP). Now, if kjournald honoured SIGSTOP/SIGCONT, I could teach noflushd
> to handle the spindown issue in userland. Uh, at least for one small detail:
> Is there a way to tell which kjournald process is associated to which
> partition? A fake cmdline, or an fd to the partition's device node that
> shows up in /proc/<pid>/fd would indeed be quite helpful.
LOL
The low-level spindles can not walk backwards to find a partition because
of the bogus aliased/virtual LBA(0)s that litter a spindle. The LBA(0)
count == Number of Partitions + 1;
This is utter crap but it is scheduled to be fixed in 2.5, now that it has
started.
Solution : Do not partition use the entire raw device but that will not
work because of the real LBA 0 -- EEK
Cheers,
Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project
On Tue, 27 Nov 2001, Daniel Kobras wrote:
> On Fri, Nov 23, 2001 at 05:25:46PM -0800, Andrew Morton wrote:
> > Also, if we had appropriate hooks into the request layer, we could detect
> > when the disk was being spun up for a read, and opporunistically flush
> > out any pending writes.
>
> Actually you can't. SCSI spinup code isn't very useful anyway, and IDE disks
> mostly handle spinup themselves. The kernel has too issue a reset to get a
> disk back alive from sleep mode, but revival from standby doesn't involve
> the kernel at all. When using the disk's internal timer, it isn't involved in
> spindown either. Teaching the request layer about disk state might therefore
> turn out to become rather messy, I suspect.
Depends on how far you want to take it. The kernel can of course query to
discover whether a device is on standby and delay writes if possible
before actually initiating a flush.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
On Nov 26, 2001 15:40 -0800, Andrew Morton wrote:
> Daniel Kobras wrote:
> > Is there a way to tell which kjournald process is associated to which
> > partition? A fake cmdline, or an fd to the partition's device node that
> > shows up in /proc/<pid>/fd would indeed be quite helpful.
>
> Andreas has a patch which puts the device major/minor into kjournald's
> process name.
It is in CVS HEAD, but appears not to be in the branches. It is below.
This should not have a problem with the 16-byte command length, because
kdevname() only returns strings of the form mm:nn, so my system has:
root 8 1 0 08:58 ? 00:00:11 [kjournald-03:07]
root 39 1 0 08:58 ? 00:00:00 [kjournald-03:05]
root 40 1 0 08:58 ? 00:00:00 [kjournald-03:09]
root 41 1 0 08:58 ? 00:00:00 [kjournald-03:0a]
root 1219 1 0 09:23 ? 00:00:02 [kjournald-3a:01]
Which are all within 16 bytes (including NUL), until we get larger
major/minor numbers.
Cheers, Andreas
===========================================================================
diff -u -u -r1.11.2.2 -r1.52
--- fs/jbd/journal.c 2001/11/11 05:11:06 1.11.2.2
+++ fs/jbd/journal.c 2001/11/27 00:10:39 1.52
@@ -210,7 +176,7 @@
recalc_sigpending(current);
spin_unlock_irq(¤t->sigmask_lock);
- sprintf(current->comm, "kjournald");
+ sprintf(current->comm, "kjournald-%s", kdevname(journal->j_dev));
/* Set up an interval timer which can be used to trigger a
commit wakeup after the commit interval expires */
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/