Date: Sat, 18 Sep 2004 22:35:29 +0200
From: Petr Vandrovec <vandrove@vc.cvut.cz>
To: Stas Sergeev <stsp@aknet.ru>
Cc: linux-kernel@vger.kernel.org
Subject: Re: ESP corruption bug - what CPUs are affected?
Message-ID: <20040918203529.GA4447@vana.vc.cvut.cz>
References: <3BFF2F87096@vcnet.vc.cvut.cz> <414C662D.5090607@aknet.ru> <20040918165932.GA15570@vana.vc.cvut.cz> <414C8924.1070701@aknet.ru>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <414C8924.1070701@aknet.ru>
User-Agent: Mutt/1.5.6+20040818i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4752
Lines: 129

On Sat, Sep 18, 2004 at 11:14:44PM +0400, Stas Sergeev wrote:
> Hi,
> 
> Petr Vandrovec wrote:
> >>Does this look reasonable? If it does, I think I
> >>should just start implementing that.
> >Do not forget that you have to implement also return to CPL1, as
> >NMI may arrive while you are running on CPL1.  So it may not be
> >as trivial as it seemed. 
> I am not sure what special actions have to be
> taken here compared to returning to ring-3 from NMI.
> Is there anywhere in the sources an example to take
> a look at? (sorry for the newbie questions)

It means that you cannot blindly create CPL1 trampoline stack
in some static per-cpu area.  But if we can assume that there
is no other CPL1 code in the system, something like code below
could work:

/* + 20 [word 5]  SS
   + 16 [word 4]  ESP
   + 12 [word 3]  EFLAGS
   +  8 [word 2]  CS
   +  4 [word 1]  EIP
   +  0 [word 0]  ESP for popl %esp
*/
u_int32_t cpl1stacks[NUM_CPUS][6];

curCPU = smp_processor_id();
minSP = curCPU * sizeof(cpl1stacks[0]);
maxSP = minSP + sizeof(cpl1stacks[0]);
cpl1stack = cpl1stacks[curCPU];

if (cpl0stack[retCS] & 3 == 1) {
	/* Going back to our trampoline */
	/* There is no other place in kernel running on CPL1
           except our trampoline; so interrupt could occur either
           on popl %esp or on iret.  If it occured on popl %esp,
           just return, code will do proper things.  If interrupt
	   occured on iret, we have to perform popl %esp again,
	   so that upper bits of %esp are correctly restored
	   for CPL3 code */
	ASSERT(cpl0stack[retCS] == FLAT_4G_CPL1_CS);
	ASSERT(cpl0stack[retSS] == SMALL_CPL1_SS);
	if (cpl0stack[retEIP] == fixup_proc) {
		ASSERT(cpl0stack[retESP] == minSP]);
	} else if (cpl0stack[retEIP] == fixup_proc_iret) {
		ASSERT(cpl0stack[retESP] & 0xFFFF == minSP + 4);
		/* Undo popl %esp - copy value from ESP we were
                   using on CPL1 back to stack */
		cpl1stack[0] = cpl0stack[retESP];
		cpl0stack[retEIP] = fixup_proc;
		cpl0stack[retESP] = minSP;
	} else {
		/* unexpected code running on CPL1 */
		/* Probably do simple IRET and hope for the best? */
		ASSERT(0);
	}
	iret;
} else {
	cpl1Stack[5] = cpl0stack[retSS];
	cpl1Stack[4] = cpl0stack[retESP];
	cpl1Stack[3] = cpl0stack[retEFLAGS];
	cpl1Stack[2] = cpl0stack[retCS];
	cpl1Stack[1] = cpl0stack[retEIP];
	cpl1Stack[0] = (cpl0stack[retESP] & 0xFFFF0000) | minSP + 4;
	cpl0stack[retSS] = SMALL_CPL1_SS;
	cpl0stack[retESP] = minSP;
	/* 
	   Do NOT clear IF... IF flag is affected only if IOPL >= CPL, so
	   with IOPL=0 IRET on CPL1 won't reenable interrupts.  This is reason
	   why we cannot use RETF to return from CPL0 to CPL1 (retf
	   is much faster than iret on P4) (and we cannot use retf for
	   CPL1->CPL3 due to TF/RF).

	   Clear TF so we do not start tracing on CPL1 if we trace
	   userspace, and clear RF, so if somebody intentionaly pointed
	   hardware breakpoint into CPL1 handler, it will be triggered
	   (we must use this path for returns from INT1 too, so it
	   is possible that RF is set in EFLAGS on stack).
	*/
	cpl0stack[retEFLAGS] &= ~(EFLAGS_TF | EFLAGS_RF);
	cpl0stack[retCS] = FLAT_4G_CPL1_CS;
	cpl0stack[retEIP] = fixup_proc;
	iret;
}

fixup_proc:
	popl %esp
fixup_proc_iret;
	iret


It assumes that there is one new 32bit CPL1 flat CS descriptor in GDT, and
one 16bit (small) CPL1 SS descriptor (grows up, with limit 24*<num_cpus>
and base of cpl1stacks), plus <num_cpus> 24byte CPL1 stacks (max. 64KB, 
so for kernels with more than 2730 CPUs you need more than one stack 
descriptor).


> >Maybe all these programs survive that
> >their CPL3 stack changes,
> Most likely they will, I am just not sure. What
> if they disabled interrupts and are switching the
> stack by loading the SS and ESP separately? If we
> interrupt it there, there may be the problems, which
> would be almost impossible to track down later.
> It just looks a bit unsafe to me. Or maybe exploit
> a sigaltstack for that? Hmm, is implementing the
> CPL1 trampoline really that difficult after all?
> I think it is somewhat cleaner and maybe safer.

As in pseudocode above I was able to handle even NMIs with just
24 bytes of stack on CPL1, it is definitely preferred solution.

> >and AFAIK LAR is microcoded on P4.
> Where does this lead us to? Some other problems I
> am not aware about?

It is slow.  No other problems, except that doing
(((ss & 4) ? gdt[ss >> 3] : ldt[ss >> 3]) & 0x00????00) == 0x00????00;
may be faster than doing (lar(ss) & 0x00????00) == 0x00????00.
						Petr

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/