Date: Tue, 16 Nov 2004 18:49:44 -0800
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: domen@coderock.org
Subject: schedule_timeout() issues / questions
Message-ID: <20041117024944.GB4218@us.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040722i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6434
Lines: 157

Hi,

After some pretty heavy discussion on IRC, I felt that it may be
important / useful to bring the discussion of schedule_timeout() to
LKML. There are two issues being considered:

1) msleep_interruptible()

For reference, here is the code for this function:

/**
 * msleep_interruptible - sleep waiting for waitqueue interruptions
 * @msecs: Time in milliseconds to sleep for
 */
unsigned long msleep_interruptible(unsigned int msecs)
{
	unsigned long timeout = msecs_to_jiffies(msecs);

	while (timeout && !signal_pending(current)) {
		set_current_state(TASK_INTERRUPTIBLE);
		timeout = schedule_timeout(timeout);
	}
	return jiffies_to_msecs(timeout);
}

The first issue deals with whether the while() loop is at all necessary.
>From my understanding (primarily from how the code "should" behave, but
also backed up by code itself), I think the following code:

	set_current_state(TASK_INTERRUPTIBLE);
	timeout = schedule_timeout(timeout);

should be interpreted as:

	a) I wish to sleep for timeout jiffies; however
	b) If a signal occurs before timeout jiffies have gone by, I
	would also like to wake up.
	
With this interpretation, though, the while()-conditional becomes
questionable. I can see two cases (think inclusive OR not exclusive) for
schedule_timeout() returning:

	a) A signal was received and thus signal_pending(current) will
	be true, exiting the loop. In this case, timeout will be
	some non-negative value (0 is a corner case, I believe, where
	both the timer fires and a signal is received in the last jiffy).
	b) The timer in schedule_timeout() has expired and thus it will
	return 0. This indicates the function has delayed the requested
	time (at least) and timeout will be set to 0, again exiting the
	loop.
	
Clearly, then, if my interpretion is correct, schedule_timeout() will
always return to a state in msleep_interruptible() which causes the loop
to only iterate the one time. Does this make sense? Is my interpretation
of schedule_timeout()s functioning somehow flawed? If not, we probably
can go ahead and change the msleep_interruptible() code, yes?

This is closely tied to a separate issue. I have often seen code check
two separate conditions in response to calling schedule_timeout():
whether signals_pending() is true and whether the return value is 0 (in
separate if statements). Depending on the correctness of my interpretation
again, these are sequentially dependent. Meaning the following code should
be a correct and equivalent way to use schedule_timeout():

	set_current_state(TASK_INTERRUPTIBLE);
	timeout = schedule_timeout(timeout);
	if (signals_pending(current))
		/* respond appropriately to signals */
	else
		/* this must mean that timeout == 0, respond accordingly */

I think this should result in smaller, more uniform code and perhaps
more well-structured looping (as most schedule_timeout()s occur in some
loop). Does this if/else make sense? Since it is contingent on the
previous interpretation, maybe you've already thrown it out... But if
not, this may result in a good number of standardizing patches (which I
think is generally a good thing). This also would make it clear to me
how exactly the kernel interprets TASK_INTERRUPTIBLE /
TASK_UNINTERRUPTIBLE. <rant> Obviously, one's intuition that
TASK_UNINTERRUPTIBLE means don't wake up until the time as gone by,
at least, is wrong, or we wouldn't need the similar while-loop in
msleep()...

/**
 * msleep - sleep safely even with waitqueue interruptions
 * @msecs: Time in milliseconds to sleep for
 */
void msleep(unsigned int msecs)
{
	unsigned long timeout = msecs_to_jiffies(msecs);

	while (timeout) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		timeout = schedule_timeout(timeout);
	}
}

</rant>
							   
2)  schedule_timeout(1)

A second issue arises when considering the prevalence in the kernel of
calling

	set_current_state(TASK_{,UN}INTERRUPTIBLE);
	schedule_timeout(1);

The 1 is rather arbitrary (but most common); any small enough number
will do. I think these code segments are now "unintentional" (in the
majority, at least; in certain cases, they may certainly be intended).
The reason I think this, is that when HZ==100 this is a pretty long
delay in human-time (~10ms, assuming no signals...). So maybe code
authors actually intended for that length of a delay, but since msleep()
and family did not exist, just went with schedule_timeout(). Of course,
since 1 jiffy of delay was equivalent to 10ms, it makes the dependency
on HZ very hard to see. This is just a theory, though, and I could be
completely wrong.

The main point, though, is that this code now introduces excessive
overhead. Upon examining the code of schedule_timeout(), one sees the
gist of the function to be:

	a) Set up a timer to go off after a certain number of jiffies
	b) Call schedule()
	c) Upon returning, delete the timer and return the number of
	jiffies left (if positive) or 0

Now, with architectures where HZ==1000 or more, I think the overhead of
setting up the timer and then calling schedule() may result in the timer
going off in the middle of the schedule() call. This, of course, is not
great, because it could mean that another task will start up (for 1
jiffy!), thrash the cache perhaps, and then be preempted by my task
again. It would be better, I think to call schedule() directly in these
cases where the intention is: give up the CPU. It results in similar
behavior (excepting that I won't necessarily be able to run in a jiffy
or close to that time), but unless I misunderstand things, that is the
same situation as with the timer in schedule_timeout() (i.e., I might be
completely wrong, though, if somehow when the timer goes off, my task
gets put to the front of it's priority queue). 

It just seems to me that we are putting in excessive code to just give
up the CPU (the intention of some of the calls, as I see it). I would like
to hear other opinions. If I am close to right, though, this will result
in a good number of (small) patches which will help clean up the kernel
a bit.

Please excuse my lack of brevity. There were a lot of ideas, and I
wanted to make myself pretty clear. Any comments would be greatly
appreciated.

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/