Now that it seem that all are in agreement that the sys_call_table
symbol shall not be exported to modules, are there any work in progress
to allow modules to get an event/notification whenever a specific
syscall is being called?
We have a specific need to trace mmap() and sbrk() calls.
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 10:19:45AM +0200, Terje Eggestad wrote:
> Now that it seem that all are in agreement that the sys_call_table
> symbol shall not be exported to modules, are there any work in progress
> to allow modules to get an event/notification whenever a specific
> syscall is being called?
No.
> We have a specific need to trace mmap() and sbrk() calls.
Well, you get mmap events for your driver and I can't imagine a sane
reason for intwercepting sbrk(). Do you have a pointer to the driver
source doing such strange things?
On Mon, 2003-05-05 at 10:19, Terje Eggestad wrote:
> Now that it seem that all are in agreement that the sys_call_table
> symbol shall not be exported to modules, are there any work in progress
> to allow modules to get an event/notification whenever a specific
> syscall is being called?
>
> We have a specific need to trace mmap() and sbrk() calls.
such trace hooks surely can be put in the mmap and sbrk calls themselves
by means of a patch for your systems ?
On Mon, May 05, 2003 at 04:01:25PM +0700, Dmitry A. Fedorov wrote:
> Almost all of my third-party drivers are broken by this.
> What is worse, redhat "backported" this "feature" to their 2.4
> patched kernels and now I should distinguish 2.4 and "redhat 2.4"
> in my compatibility headers.
What about just fixing your drivers instead of moaning? If you submit
a pointer to your driver source and explain what you want to do someone
might even help you..
On 5 May 2003, Terje Eggestad wrote:
> Now that it seem that all are in agreement that the sys_call_table
> symbol shall not be exported to modules, are there any work in progress
No, I disagree.
> to allow modules to get an event/notification whenever a specific
> syscall is being called?
I need this table to _call_ any of system calls that available to the
process, nothing else. Sys_call_table can be placed in .rodata section
(there was patch a few days ago) to prevent modification from modules.
But why module should not have ability to call any function which is
available from user space?
Almost all of my third-party drivers are broken by this.
What is worse, redhat "backported" this "feature" to their 2.4
patched kernels and now I should distinguish 2.4 and "redhat 2.4"
in my compatibility headers.
On Mon, 2003-05-05 at 11:01, Dmitry A. Fedorov wrote:
> But why module should not have ability to call any function which is
> available from user space?
that's you you can just call sys_read() and co directly.
Unfortunately we live in an insane world.
First of all, in the Changelog where the export was removed for 2.5.41
http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41
Arjan lists 4 reasons for having the export in the first place, and I'm
on point 3. Here Arjan pretty much acknowledges that there is a
legitimate need to have a event/hook system to be informed of a syscall.
The exact quote is: "Eg the use of the export in this just a bandaid due
to lack of a proper mechanism".
My argument for *why* there should be a mechanism stops here.
Since you're bright inquisitive: The exact problem I'm facing is pretty
complex:
1. performance is everything.
2. We're making a MPI library, and as such we don't have any control
with the application.
3a. The various hardware for cluster interconnect all work with DMA.
3b. the performance loss from copying from a receive area to the
userspace buffer is unacceptable.
3c. It's therefore necessary for HW to access user pages.
4. In order to to 3, the user pages must be pinned down.
5. the way MPI is written, it's not using a special malloc() to allocate
the send receive buffers. It can't since it would break language binding
to fortran. Thus ANY writeable user page may be used.
6. point 4: pinning is VERY expensive (point 1), so I can't pin the
buffers every time they're used.
7. The only way to cache buffers (to see if they're used before and
hence pinned) is the user space virtual address. A syscall, thus ioctl
to a device file is prohibitive expensive under point 1.
8a. if the app (glibc in practice, but you never know) use sbrk() with a
negative arg, and then a positive argument, I can get a a different set
of user pages with the same address.
8b ditto with a set of munmap()/mmap().
9. since the number of times. any 'realloc' may happen is << than the
numbers of times any buffer may be used, it's necessary under point 1 to
to trace changes to virtual addresses to phys pages, rather than test
every time an address is being used.
10. kernel patches are impractical, I must be able to do this with std
stock, redhat, AND suse kernels.
On Mon, 2003-05-05 at 10:23, Christoph Hellwig wrote:
> On Mon, May 05, 2003 at 10:19:45AM +0200, Terje Eggestad wrote:
> > Now that it seem that all are in agreement that the sys_call_table
> > symbol shall not be exported to modules, are there any work in progress
> > to allow modules to get an event/notification whenever a specific
> > syscall is being called?
>
> No.
>
> > We have a specific need to trace mmap() and sbrk() calls.
>
> Well, you get mmap events for your driver and I can't imagine a sane
> reason for intwercepting sbrk(). Do you have a pointer to the driver
> source doing such strange things?
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> 1. performance is everything.
> 2. We're making a MPI library, and as such we don't have any control
> with the application.
> 3a. The various hardware for cluster interconnect all work with DMA.
> 3b. the performance loss from copying from a receive area to the
> userspace buffer is unacceptable.
> 3c. It's therefore necessary for HW to access user pages.
> 4. In order to to 3, the user pages must be pinned down.
see how AIO does this, and O_DIRECT, and rawio.
They all have the same requirement and manage to cope.
On Mon, 2003-05-05 at 11:38, Arjan van de Ven wrote:
> On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> > 1. performance is everything.
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application.
> > 3a. The various hardware for cluster interconnect all work with DMA.
> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable.
> > 3c. It's therefore necessary for HW to access user pages.
> > 4. In order to to 3, the user pages must be pinned down.
>
> see how AIO does this, and O_DIRECT, and rawio.
>
> They all have the same requirement and manage to cope.
Ok, I havn't actually checked the code , but no, they don't have the
same requirement. they pin and unpin the user space memory at the
beginning and and of the operations.
take aio pseudo code:
aio_write()
{
pinmem();
if (file)
add_write_to_disk_queue();
.
.
.
};
kernel_aio_completion_handler()
{
unpinmem();
send_completion_event_to_task();
};
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> 1. performance is everything.
then Linux is the wrong OS for you :)
> 2. We're making a MPI library, and as such we don't have any control
> with the application.
I can't remember that the MPI spec tells anything about intercepting
syscalls..
> 3b. the performance loss from copying from a receive area to the
> userspace buffer is unacceptable.
> 3c. It's therefore necessary for HW to access user pages.
> 4. In order to to 3, the user pages must be pinned down.
> 5. the way MPI is written, it's not using a special malloc() to allocate
> the send receive buffers. It can't since it would break language binding
> to fortran. Thus ANY writeable user page may be used.
so use get_user_pages.
> 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> buffers every time they're used.
Umm, pinning memory all the time means you get a bunch of nice DoS
attachs due to the huge amount of memory.
> 7. The only way to cache buffers (to see if they're used before and
> hence pinned) is the user space virtual address. A syscall, thus ioctl
> to a device file is prohibitive expensive under point 1.
That's a horribly b0rked approach..
Again, where's your driver source so we can help you to find a better
approach out of that mess?
On Mon, 2003-05-05 at 12:25, Christoph Hellwig wrote:
> On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> > 1. performance is everything.
>
> then Linux is the wrong OS for you :)
>
Strangely enough not. You just have to try and stay out of the kernel as
much as possible ;-)
Of course some idiot sold the total-cost-of-ownership thingy of linux to
the customers. What they really need is a OS/360...
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application.
>
> I can't remember that the MPI spec tells anything about intercepting
> syscalls..
>
It's says quite a bit about what memory can be used for comm buffers.
> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable.
> > 3c. It's therefore necessary for HW to access user pages.
> > 4. In order to to 3, the user pages must be pinned down.
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used.
>
> so use get_user_pages.
Let me clearify: pinning pages are not, repeat not a problem.
The problem occur when you
1. pinn a buffer
2. sbrk(-n) or munmap() (usually thru free()) the area the buffer
3. a new malloc() resulting in a sbrk(+n) or mmap()
4. then my new buffer has the exactly same virtual address as the prev.
(belive it or not this happens, and relatively frequently).
>
> > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > buffers every time they're used.
>
> Umm, pinning memory all the time means you get a bunch of nice DoS
> attachs due to the huge amount of memory.
>
This is HPC clusters. DoS is a non issue. This is not the normal multi
user systems. In fact you run one active process per CPU.
> > 7. The only way to cache buffers (to see if they're used before and
> > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > to a device file is prohibitive expensive under point 1.
>
> That's a horribly b0rked approach..
>
It's *FAST*.
> Again, where's your driver source so we can help you to find a better
> approach out of that mess?
>
The trace module we made to trace munmap() and sbrk() could be opened,
but you'll be disappointed since all the pinning ( get_user_pages() and
friends), send() recv() etc are in the drivers for the various hardware,
most of which are not our property.
The module works as follows. It catches sbrk(-arg) and munmap() and lays
out the trace info in a memory area mmap()'able thru the device file.
So when processes need the trace info they have the info in memory to
avoid doing a ioctl().
Thats all we need to know if a given virtual address needs to be
(re)pinned.
Lets deal, I'll GPL the trace module if you get me a
EXPORT_SYMBOL_GPL(sys_call_table);
TJ
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 01:23:19PM +0200, Terje Eggestad wrote:
> Lets deal, I'll GPL the trace module if you get me a
> EXPORT_SYMBOL_GPL(sys_call_table);
the sys call table is not un-exported for license-political reasons.
It's unexported because there is no correct use for it and that it can't
be used correctly either. Tell me which lock your module uses to protect
modifications to it? Tell me how you handle other modules trying to
overload the same syscall and those modules loading before your module but
then unloading while yours is still loaded?
It's the wrong mechanism to do ANYTHING. Really.
On Mon, May 05, 2003 at 01:31:19PM +0200, Terje Eggestad wrote:
> In all fairness this should be done in glibc,
... or a LD_PRELOAD library......
On Mon, 2003-05-05 at 13:23, Terje Eggestad wrote:
>
> > Again, where's your driver source so we can help you to find a better
> > approach out of that mess?
> >
>
> The trace module we made to trace munmap() and sbrk() could be opened,
> but you'll be disappointed since all the pinning ( get_user_pages() and
> friends), send() recv() etc are in the drivers for the various hardware,
> most of which are not our property.
>
> The module works as follows. It catches sbrk(-arg) and munmap() and lays
> out the trace info in a memory area mmap()'able thru the device file.
> So when processes need the trace info they have the info in memory to
> avoid doing a ioctl().
>
> Thats all we need to know if a given virtual address needs to be
> (re)pinned.
>
In all fairness this should be done in glibc, but the task of getting it
done there was several orders of magnitude larger than just adding the
syscall intercepts. Serves you right for writing clean code :-)
The thing is of course this *worked* until someone decided to remove the
export of sys_call_table.
Which is a decision that is most probably right, I just need another way
of getting a hook or notification of the sys calls.
>
>
> TJ
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Llu, 2003-05-05 at 10:33, Terje Eggestad wrote:
> 1. performance is everything.
Then you can live with building custom patched kernels
> 2. We're making a MPI library, and as such we don't have any control
> with the application.
LD_PRELOAD
> 3c. It's therefore necessary for HW to access user pages.
Like TV cards do. That isnt hard
> 4. In order to to 3, the user pages must be pinned down.
> 5. the way MPI is written, it's not using a special malloc() to allocate
> the send receive buffers. It can't since it would break language binding
> to fortran. Thus ANY writeable user page may be used.
Well not all the pages are guaranteed DMAable, so I guess you already
lost.
> 10. kernel patches are impractical, I must be able to do this with std
> stock, redhat, AND suse kernels.
So you want every vendor to screw up their kernels and the base kernel
for an obscure (but fun) corner case. Thats not a rational choice is it.
You want "performance is everything" you pay the price, don't make
everyone suffer.
On Mon, May 05, 2003 at 01:23:19PM +0200, Terje Eggestad wrote:
> The problem occur when you
> 1. pinn a buffer
> 2. sbrk(-n) or munmap() (usually thru free()) the area the buffer
> 3. a new malloc() resulting in a sbrk(+n) or mmap()
> 4. then my new buffer has the exactly same virtual address as the prev.
>
> (belive it or not this happens, and relatively frequently).
That only shows that you really don't want to use glibc's malloc and
sbrk implementations, but ones that are implemented as mmap in your
driver so you can keep track of it properly. LD_PRELOAD is your friend.
> Lets deal, I'll GPL the trace module if you get me a
> EXPORT_SYMBOL_GPL(sys_call_table);
Who cares about your trace module? That's the wrong approach to start
with. And the removal of the sys_call_table export is not a political
issue but a technical one. The interesting thing would be your memory
manager, but given the above hints you really should be able to fix it yourself
now..
On Mon, 2003-05-05 at 13:16, Alan Cox wrote:
> On Llu, 2003-05-05 at 10:33, Terje Eggestad wrote:
> > 1. performance is everything.
>
> Then you can live with building custom patched kernels
>
If there was numerous issues, sure. But every time we get to the point
where it seem that that is necessary we find a workaround.
Right now, this is the ONLY issue we got..
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application.
>
> LD_PRELOAD
>
IN general LD_PRELOAD is fun for testing and academic programs, but not
for production code.
In specific you run into a problem with how fortran 90 compilers do
dynamical arrays. It's very compiler dependent.
> > 3c. It's therefore necessary for HW to access user pages.
>
> Like TV cards do. That isnt hard
>
nobody said it is.
> > 4. In order to to 3, the user pages must be pinned down.
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used.
>
> Well not all the pages are guaranteed DMAable, so I guess you already
> lost.
>
Nope. The drivers test to see if the page is DMAable, and do a copy if
necessary. Most of the high performance interconnects NIC's do 64 bit
PCI.
> > 10. kernel patches are impractical, I must be able to do this with std
> > stock, redhat, AND suse kernels.
>
> So you want every vendor to screw up their kernels and the base kernel
> for an obscure (but fun) corner case. Thats not a rational choice is it.
> You want "performance is everything" you pay the price, don't make
> everyone suffer.
No! I don't disagree with removing the export of the syscall_table!
I just want the "proper mechanism" indicated by Arjan in the changelog.
Pls read this thread. There are legitimate uses to having syscall
hooks/notifications, either you think mine is or not.
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 04:01:25PM +0700, Dmitry A. Fedorov wrote:
>> But why module should not have ability to call any function which is
>> available from user space?
>> Almost all of my third-party drivers are broken by this.
>> What is worse, redhat "backported" this "feature" to their 2.4
>> patched kernels and now I should distinguish 2.4 and "redhat 2.4"
>> in my compatibility headers.
From: Arjan van de Ven <[email protected]>
> that's you you can just call sys_read() and co directly.
Yes, for redhat kernels - almost all of sys_* functions are exported.
And there is kernel.org's one with only few sys_* exported.
And how I will distinguish redhat's kernel from other ones? - there is
no something like #define REDHAT_PATCHED in headers.
I don't want to have separate driver source version
for each of incompatible kernel variant, I prefer to have single
driver source which is adapted to user's environment at compilation
time.
From: Christoph Hellwig <[email protected]>
> What about just fixing your drivers instead of moaning? If you
> submit a pointer to your driver source and explain what you want to
> do someone might even help you..
Of course, I will fix my drivers (permanent kernel changes
provides us maintainence job forever :).
For example:
http://www.rtdusa.com/software/RTDFinland/ECAN_Linux.zip
http://www.rtdusa.com/software/RTDFinland/UPS25_Linux.ZIP
I use the following calls:
sys_mknod
sys_chown
sys_umask
sys_unlink
for creating/deleting /dev entries dynamically on driver
loading/unloading. It allows me to acquire dynamic major
number without devfs and external utility of any kind.
And there is no risk of intersection with statically assigned major
numbers, as it is for many others third-party sources.
It works long time for any kernels from 2.0 to 2.4 (except the last
redhat's 2.4) and it should works with 2.6, I hope.
I use sys_write to output loading/device detection/diagnostic
messages to process's stderr when appropriate. Yes, it may look as
"wrong thing" but it uses only legal kernel mechanisms and it saves
lots of time with e-mail support:
/sbin/insmod driver verbose=1 2>&1 | mail -s 'it does not works' me@
It would be nice if either sys_call_table left exported and placed in
read-only data section to prevent modification (do you want just that?)
or _all_ of sys_* function would be exported in original kernel.
temper, temper
pls read my reply to alan carefully .
Doing own malloc(), free(), m[un]map(), is a possibility we've
considered. Since we've got our own lib linked with the app, we probably
wouldn't even need LD_PRELOAD. our main issue is that not everything is
gcc/g77.
Of all the approaches the syscall traps was the least intrusive and most
portable of all, belive it or not.
BTW: this is all technical issues.
On Mon, 2003-05-05 at 14:52, Christoph Hellwig wrote:
>
> That only shows that you really don't want to use glibc's malloc and
> sbrk implementations, but ones that are implemented as mmap in your
> driver so you can keep track of it properly. LD_PRELOAD is your friend.
> Who cares about your trace module? That's the wrong approach to start
> with. And the removal of the sys_call_table export is not a political
> issue but a technical one. The interesting thing would be your memory
> manager, but given the above hints you really should be able to fix it yourself
> now..
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
> I use the following calls:
>
> sys_mknod
> sys_chown
> sys_umask
> sys_unlink
>
> for creating/deleting /dev entries dynamically on driver
> loading/unloading. It allows me to acquire dynamic major
> number without devfs and external utility of any kind.
> And there is no risk of intersection with statically assigned major
> numbers, as it is for many others third-party sources.
You don't want to tell me you do that for real, do you?
That alone is a very good idea to unexport the syscall table without
exporting those symbols..
On Mon, May 05, 2003 at 03:41:23PM +0200, Terje Eggestad wrote:
> temper, temper
>
> pls read my reply to alan carefully .
>
> Doing own malloc(), free(), m[un]map(), is a possibility we've
> considered. Since we've got our own lib linked with the app, we probably
> wouldn't even need LD_PRELOAD. our main issue is that not everything is
> gcc/g77.
Well, if the compiler doesn't play nicely with that that's your / the
compiler vendors problem. Especially if it's not available in source
code..
On Mon, May 05, 2003 at 08:30:38PM +0700, Dmitry A. Fedorov wrote:
> I use the following calls:
>
> sys_mknod
> sys_chown
> sys_umask
> sys_unlink
>
> for creating/deleting /dev entries dynamically on driver
> loading/unloading. It allows me to acquire dynamic major
> number without devfs and external utility of any kind.
> And there is no risk of intersection with statically assigned major
> numbers, as it is for many others third-party sources.
*yuck*
Do that from modprobe. "No external utility" is not a virtue, especially
when said utility is a trivial shell script.
OK
My warp'ed insane problem aside, Let me ask this:
"Do you acknowledge a legitimate need to have syscall hooks?"
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, May 05, 2003 at 03:50:26PM +0200, Terje Eggestad wrote:
> "Do you acknowledge a legitimate need to have syscall hooks?"
No.
On Mon, May 05, 2003 at 03:50:26PM +0200, Terje Eggestad wrote:
> OK
>
> My warp'ed insane problem aside, Let me ask this:
>
>
> "Do you acknowledge a legitimate need to have syscall hooks?"
not as general thing.
there are specific cases when some notification is needed, see for example
oprofile in 2.5....
Christoph Hellwig wrote:
> On Mon, May 05, 2003 at 03:50:26PM +0200, Terje Eggestad wrote:
>
>>"Do you acknowledge a legitimate need to have syscall hooks?"
>
> No.
LSM?
Regards,
Carl-Daniel
On Mon, May 05, 2003 at 04:28:08PM +0200, Carl-Daniel Hailfinger wrote:
> LSM?
LSM is explicitly not syscall hooks. And educated readers of lkml should
now my opinion on LSM...
On Mon, 5 May 2003 [email protected] wrote:
> On Mon, May 05, 2003 at 08:30:38PM +0700, Dmitry A. Fedorov wrote:
> > I use the following calls:
> >
> > sys_mknod
> > sys_chown
> > sys_umask
> > sys_unlink
> >
> > for creating/deleting /dev entries dynamically on driver
> > loading/unloading. It allows me to acquire dynamic major
> > number without devfs and external utility of any kind.
> > And there is no risk of intersection with statically assigned major
> > numbers, as it is for many others third-party sources.
>
> *yuck*
>
> Do that from modprobe. "No external utility" is not a virtue, especially
> when said utility is a trivial shell script.
What about modprobe? Dynamic major number can be acquired only by the
module itself. Only after that the appropriate /dev entry can be
created. External utility must get major number from the module
but without the /dev entry there is no communication end point with the
module.
Only two possibilities exists:
1. /dev entries created with statically assigned major/minor numbers.
It is inconvenient for third-party modules.
2. devfs or procfs (/dev entry is just a symlink to some /proc/ entry
which will be created with the device attributes later).
You should look at my approach as tiny portable private devfs library.
It would works with and without devfs, without procfs (stripped in
embedded environment), with old and new kernels.
There is no "illegal" kernel mechanisms used.
Only one thing required - availability of systems calls.
On Mon, 5 May 2003, Arjan van de Ven wrote:
> On Mon, May 05, 2003 at 01:31:19PM +0200, Terje Eggestad wrote:
> > In all fairness this should be done in glibc,
>
> ... or a LD_PRELOAD library......
which doesn't work with statically linked binaries, does it?
Regards
Tigran
On Mon, May 05, 2003 at 04:53:12PM +0100, Tigran Aivazian wrote:
> > ... or a LD_PRELOAD library......
>
> which doesn't work with statically linked binaries, does it?
No. But given the source to the application you can
easily override glibc's weak malloc symbol at link-time.
On Mon, 5 May 2003, Christoph Hellwig wrote:
> > I use the following calls:
> >
> > sys_mknod
> > sys_chown
> > sys_umask
> > sys_unlink
> >
> > for creating/deleting /dev entries dynamically on driver
> > loading/unloading. It allows me to acquire dynamic major
> > number without devfs and external utility of any kind.
> > And there is no risk of intersection with statically assigned major
> > numbers, as it is for many others third-party sources.
>
> You don't want to tell me you do that for real, do you?
I do that for real.
Please, think about it as small portable private devfs library.
> That alone is a very good idea to unexport the syscall table without
> exporting those symbols..
It does not helps, I would find another way, maybe vfs_* calls
or proc_mknod, unexport it too.
> On Mon, 5 May 2003, Arjan van de Ven wrote:
>
> > On Mon, May 05, 2003 at 01:31:19PM +0200, Terje Eggestad wrote:
> > > In all fairness this should be done in glibc,
> >
> > ... or a LD_PRELOAD library......
>
> which doesn't work with statically linked binaries, does it?
good thing the LGPL on glibc requires a relinkable version to be offered
as well ;)
Christoph Hellwig wrote:
> On Mon, May 05, 2003 at 04:28:08PM +0200, Carl-Daniel Hailfinger wrote:
>
>>LSM?
>
> LSM is explicitly not syscall hooks. And educated readers of lkml should
Yes, sorry, I mixed that up with an old Usenix paper.
> know my opinion on LSM...
Um yeah.
/me puts on asbestos suit
I remember your patch to remove the nested syscall (sys_security) for
LSM quite well.
Carl-Daniel
> I use the following calls:
>
> sys_mknod
> sys_chown
> sys_umask
> sys_unlink
>
> for creating/deleting /dev entries dynamically on driver
> loading/unloading. It allows me to acquire dynamic major
> number without devfs and external utility of any kind.
Well, duh. "Without devds and external utility" is a no-goal.
You set it, you screw trying to achieve it. It's like a well-known
Russian joke: "[...] We remove the adenoid tissue... through
the anal opening with a blowtorch".
> I use sys_write to output loading/device detection/diagnostic
> messages to process's stderr when appropriate. Yes, it may look as
> "wrong thing" but it uses only legal kernel mechanisms and it saves
> lots of time with e-mail support:
> /sbin/insmod driver verbose=1 2>&1 | mail -s 'it does not works' me@
And pray tell how is syslog different?
-- Pete
> Lets deal, I'll GPL the trace module if you get me a
> EXPORT_SYMBOL_GPL(sys_call_table);
You could always use the rootkit techniques from Phrack 58 to find
the table... seems kind of silly to do that in kernel mode, but it
should work.
Good point, it should actually be very simple.
from /proc/ksyms we've got teh adresses of the sys_*, then from
asm/unistd.h we got the order.
Then search thru /dev/kmem until you find the right string og addresses,
and you got sys_call_table.
Dirty but it should be portable.
On Mon, 2003-05-05 at 23:29, Chuck Ebbert wrote:
> Lets deal, I'll GPL the trace module if you get me a
> EXPORT_SYMBOL_GPL(sys_call_table);
You could always use the rootkit techniques from Phrack 58 to find
the table... seems kind of silly to do that in kernel mode, but it
should work.
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Mon, 5 May 2003, Pete Zaitcev wrote:
> > for creating/deleting /dev entries dynamically on driver
> > loading/unloading. It allows me to acquire dynamic major
> > number without devfs and external utility of any kind.
>
> Well, duh. "Without devds and external utility" is a no-goal.
> You set it, you screw trying to achieve it. It's like a well-known
> Russian joke: "[...] We remove the adenoid tissue... through
> the anal opening with a blowtorch".
:)
I disagree. It is small and nice solution. It is my own devfs for
pre-devfs kernels.
> > I use sys_write to output loading/device detection/diagnostic
> > messages to process's stderr when appropriate. Yes, it may look as
> > "wrong thing" but it uses only legal kernel mechanisms and it saves
> > lots of time with e-mail support:
> > /sbin/insmod driver verbose=1 2>&1 | mail -s 'it does not works' me@
>
> And pray tell how is syslog different?
syslog has the same text first.
On 6 May 2003, Terje Eggestad wrote:
> Good point, it should actually be very simple.
> from /proc/ksyms we've got teh adresses of the sys_*, then from
> asm/unistd.h we got the order.
/proc/ksyms shows only exported symbols, is not it?
> Then search thru /dev/kmem until you find the right string og addresses,
> and you got sys_call_table.
>
> Dirty but it should be portable.
Yes, but it should be enough
On Tue, 2003-05-06 at 04:23, Dmitry A. Fedorov wrote:
> On 6 May 2003, Terje Eggestad wrote:
>
> > Good point, it should actually be very simple.
> > from /proc/ksyms we've got teh adresses of the sys_*, then from
> > asm/unistd.h we got the order.
>
> /proc/ksyms shows only exported symbols, is not it?
>
> > Then search thru /dev/kmem until you find the right string og addresses,
> > and you got sys_call_table.
> >
> > Dirty but it should be portable.
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Christoph Hellwig <[email protected]> writes:
> On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> > 1. performance is everything.
>
> then Linux is the wrong OS for you :)
>
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application.
>
> I can't remember that the MPI spec tells anything about intercepting
> syscalls..
>
> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable.
> > 3c. It's therefore necessary for HW to access user pages.
> > 4. In order to to 3, the user pages must be pinned down.
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used.
Looking at the mpi spec there are two forms of point to point communications.
1) mpi_send/mpi_recv which do have that limitation.
2) mpi_put/mpi_get which are restricted to be used with a specifically
allocated window, and the window can be restricted to areas allocated
with mpi_alloc_mem.
So the mpi_put/mpi_get should be easy to optimize.
Handling mpi_send/mpi_recv is more difficult. MPI specifies
that the data can be copied it just does not require it so in
sufficiently weird situations a copy slow path can be taken.
So there are really two questions here.
1) What is a clean way to provide a high performance message
passing layer. Assuming you have a network card for which
it is safe to mmap a subset of control registers.
2) What is a good way to map MPI onto that clean layer.
I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it. With the driver mmap operation
informing the network card of the mapping.
A good implementation of mpi on top of that is an interesting
question. Replacing malloc and free and having everything run on
top of the mmapped buffer sounds like a possibility. But it is
additionally desirable for the memory used by an MPI job to come
from hugetlbfs, or the equivalent. And I don't know if a driver
can provide huge pages.
At this point I am strongly tempted to see what it would take to come
up with an MPI-2.1 to fix this issue.
> so use get_user_pages.
>
> > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > buffers every time they're used.
>
> Umm, pinning memory all the time means you get a bunch of nice DoS
> attachs due to the huge amount of memory.
I wonder if there is an easy way to optimize this if you don't have
swap configured. In general it is a bug if an MPI job swaps.
In general there is one mpi process per cpu running on a machine. So
I have trouble seeing this as a denial of service.
> > 7. The only way to cache buffers (to see if they're used before and
> > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > to a device file is prohibitive expensive under point 1.
>
> That's a horribly b0rked approach..
>
> Again, where's your driver source so we can help you to find a better
> approach out of that mess?
With some digging I can find the source for both quadrics and myrinet
drivers, and they have the same issues. This is a general problem
for running MPI jobs so it is probably worth finding a solution that
works for those people whose source we can obtain.
Eric
On Tue, 2003-05-06 at 09:30, Eric W. Biederman wrote:
> Christoph Hellwig <[email protected]> writes:
>
>
> Looking at the mpi spec there are two forms of point to point communications.
> 1) mpi_send/mpi_recv which do have that limitation.
> 2) mpi_put/mpi_get which are restricted to be used with a specifically
> allocated window, and the window can be restricted to areas allocated
> with mpi_alloc_mem.
>
> So the mpi_put/mpi_get should be easy to optimize.
>
> Handling mpi_send/mpi_recv is more difficult. MPI specifies
> that the data can be copied it just does not require it so in
> sufficiently weird situations a copy slow path can be taken.
>
> So there are really two questions here.
> 1) What is a clean way to provide a high performance message
> passing layer. Assuming you have a network card for which
> it is safe to mmap a subset of control registers.
>
> 2) What is a good way to map MPI onto that clean layer.
>
All applications pretty much uses send/recv.
> I believe the answer on how to do a clean safe interface is
> to allocate the memory and tell the card about it in the driver,
> and then allow user space to mmap it. With the driver mmap operation
> informing the network card of the mapping.
>
You can't mmap() a buffer every time your going to do a send/recv, it's
way to costly.
> A good implementation of mpi on top of that is an interesting
> question. Replacing malloc and free and having everything run on
> top of the mmapped buffer sounds like a possibility. But it is
> additionally desirable for the memory used by an MPI job to come
> from hugetlbfs, or the equivalent. And I don't know if a driver
> can provide huge pages.
>
> At this point I am strongly tempted to see what it would take to come
> up with an MPI-2.1 to fix this issue.
>
all current MPI apps uses MPI-1
> > so use get_user_pages.
> >
> > > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > > buffers every time they're used.
> >
> > Umm, pinning memory all the time means you get a bunch of nice DoS
> > attachs due to the huge amount of memory.
>
> I wonder if there is an easy way to optimize this if you don't have
> swap configured. In general it is a bug if an MPI job swaps.
>
hmm, it's not a problem as long as you only page out data page used only
under initialization, or pages that are used very infrequent. That is
actually a good thing, since you could fit a bit more live data in
memory.
> In general there is one mpi process per cpu running on a machine. So
> I have trouble seeing this as a denial of service.
>
> > > 7. The only way to cache buffers (to see if they're used before and
> > > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > > to a device file is prohibitive expensive under point 1.
> >
> > That's a horribly b0rked approach..
> >
> > Again, where's your driver source so we can help you to find a better
> > approach out of that mess?
>
> With some digging I can find the source for both quadrics and myrinet
> drivers, and they have the same issues. This is a general problem
> for running MPI jobs so it is probably worth finding a solution that
> works for those people whose source we can obtain.
>
Hmm, no the drivers, don't have the issue, the MPI implementations do.
The two used approaches are 1) replace malloc() and friends, which break
with fortran 90 compilers 2) tell glibc never to release alloced memory
thru sbrk(-n) or munmap() which also break with f90 compilers, and run
the risk of bloating memory usage.
> Eric
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On 6 May 2003, Terje Eggestad wrote:
> > On 6 May 2003, Terje Eggestad wrote:
> >
> > > Good point, it should actually be very simple.
> > > from /proc/ksyms we've got teh adresses of the sys_*, then from
> > > asm/unistd.h we got the order.
> >
> > /proc/ksyms shows only exported symbols, is not it?
> Yes, but it should be enough
But how? When some global will not be exported, it would not be listed
in /proc/ksyms.
> But how? When some global will not be exported, it would not be listed
> in /proc/ksyms.
So what ?
You just find the right address (in this case by getting the addresses of
exported syscalls and finding a list in memory, containing them in the
right order), and cast it to be the syscall table. If you want it to work
with a binary-only driver, you can even insmod a small module that does
that and adds the result to the symbol table for other modules to use.
We've been doing that for years on closed-source systems like AIX. The
above is just one way to locate a struct in memory. A faster way is to
find some exported structs which are known to point to the unexported
symbol from some offset, extract the symbol's address, and "re-export" it.
In fact, in linux which is opensource, you can probably write a script
that extracts any unexported symbol from the source code, find a path to
it from some exported symbol, and automagically create a module that
re-exports this symbol for your legacy driver to use.
If you write the script, don't forget to GPL it :)
Yoav Weiss
Terje Eggestad <[email protected]> writes:
> On Tue, 2003-05-06 at 09:30, Eric W. Biederman wrote:
> > Christoph Hellwig <[email protected]> writes:
> >
>
> > Handling mpi_send/mpi_recv is more difficult. MPI specifies
> > that the data can be copied it just does not require it so in
> > sufficiently weird situations a copy slow path can be taken.
> >
> > So there are really two questions here.
> > 1) What is a clean way to provide a high performance message
> > passing layer. Assuming you have a network card for which
> > it is safe to mmap a subset of control registers.
> >
> > 2) What is a good way to map MPI onto that clean layer.
> >
>
> All applications pretty much uses send/recv.
>
> > I believe the answer on how to do a clean safe interface is
> > to allocate the memory and tell the card about it in the driver,
> > and then allow user space to mmap it. With the driver mmap operation
> > informing the network card of the mapping.
> >
>
> You can't mmap() a buffer every time your going to do a send/recv, it's
> way to costly.
Definitely not. But if the memory malloc returns is originally
from a mmaped buffer area (mmaped from your driver) it can be useful.
I assume somewhere your card has the smarts to transform virtual to
physical addresses and this is what the mmap sets up.
That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
> > A good implementation of mpi on top of that is an interesting
> > question. Replacing malloc and free and having everything run on
> > top of the mmapped buffer sounds like a possibility. But it is
> > additionally desirable for the memory used by an MPI job to come
> > from hugetlbfs, or the equivalent. And I don't know if a driver
> > can provide huge pages.
> >
> > At this point I am strongly tempted to see what it would take to come
> > up with an MPI-2.1 to fix this issue.
> >
>
> all current MPI apps uses MPI-1
Given that mpich does not even implement mpi_put/mpi_get I can
easily believe it for this case. All of the MPI file I/O which
also does get used at least to some extent also is part of MPI-2.
> > > so use get_user_pages.
> > >
> > > > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > > > buffers every time they're used.
> > >
> > > Umm, pinning memory all the time means you get a bunch of nice DoS
> > > attachs due to the huge amount of memory.
> >
> > I wonder if there is an easy way to optimize this if you don't have
> > swap configured. In general it is a bug if an MPI job swaps.
> >
>
> hmm, it's not a problem as long as you only page out data page used only
> under initialization, or pages that are used very infrequent. That is
> actually a good thing, since you could fit a bit more live data in
> memory.
Right. Defining it as a bug was to emphasize the point that paging is
a non-issue and for the most part an MPI job is already pinned in
memory. I totally agree that having swapping enabled and being able
to page out every unused page in the is useful.
> > In general there is one mpi process per cpu running on a machine. So
> > I have trouble seeing this as a denial of service.
> >
> > > > 7. The only way to cache buffers (to see if they're used before and
> > > > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > > > to a device file is prohibitive expensive under point 1.
> > >
> > > That's a horribly b0rked approach..
> > >
> > > Again, where's your driver source so we can help you to find a better
> > > approach out of that mess?
> >
> > With some digging I can find the source for both quadrics and myrinet
> > drivers, and they have the same issues. This is a general problem
> > for running MPI jobs so it is probably worth finding a solution that
> > works for those people whose source we can obtain.
> >
>
> Hmm, no the drivers, don't have the issue, the MPI implementations
> do.
The drivers have the issue of how to provide an interface for
the mpi implementation that sits on top of them. I totally agree this
looks like a bug in MPI.
> The two used approaches are 1) replace malloc() and friends, which break
> with fortran 90 compilers 2) tell glibc never to release alloced memory
> thru sbrk(-n) or munmap() which also break with f90 compilers, and run
> the risk of bloating memory usage.
Actually there is a third. Hack the vm layer and require a highly
patched kernel. That is the approach quadrics was using last time I
looked although they promised something different in their next major
rev.
Is it pgi or intels f90 compilers that break, and how do they break.
Replacing malloc and friends should be well defined if you simply
replace or wrap the symbols glibc provides.
Quite possibly the answer is to call those compilers ABI
non-conformant and get them fixed. Especially given that they are not
compatible with g77 in fortran mode there is a good case for this. By
default the native compiler is correct.
So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?
Eric
On Tue, 6 May 2003, Yoav Weiss wrote:
> > But how? When some global will not be exported, it would not be listed
> > in /proc/ksyms.
>
> So what ?
> You just find the right address (in this case by getting the addresses of
> exported syscalls and finding a list in memory, containing them in the
> right order), and cast it to be the syscall table.
Thank, now I understand it. And I would not do that.
> it from some exported symbol, and automagically create a module that
> re-exports this symbol for your legacy driver to use.
All of my drivers are not legacy or binary-only.
Under "third-party driver" in my other posts I was mean just out of
kernel source tree software which are have no reasons to be included in
the kernel sources.
I just need legal kernel mechanisms to do some "strange" things,
nothing else.
> If you write the script, don't forget to GPL it :)
I will not make such script.
On Tue, 2003-05-06 at 11:21, Eric W. Biederman wrote:
> Terje Eggestad <[email protected]> writes:
>
> >
> > > I believe the answer on how to do a clean safe interface is
> > > to allocate the memory and tell the card about it in the driver,
> > > and then allow user space to mmap it. With the driver mmap operation
> > > informing the network card of the mapping.
> > >
> >
> > You can't mmap() a buffer every time your going to do a send/recv, it's
> > way to costly.
>
> Definitely not. But if the memory malloc returns is originally
> from a mmaped buffer area (mmaped from your driver) it can be useful.
> I assume somewhere your card has the smarts to transform virtual to
> physical addresses and this is what the mmap sets up.
>
The problem I've got happen when an app registers the memory with the
driver, releases the memory back to the kernel thru sbrk(-n) or
munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
the same vaddr.
mapping from vaddr to phys addr happen at the registration point.
Querying the kernel for a vaddrs phys addr every time it's used is too
costly. There is a better explanantion in a earlier post.
> That can be handled in user space by querying the mmaped region. But
> if the card does not have the smarts to do the virtual to physical
> translation, or at the very least limit the set of physical pages a
> user space a do DMA to/from that is a fundamental security issue and
> means all of the optimizations are not safe. And you must enter/exit
> the kernel to send a DMA transaction.
>
send/recv don't need kernel interaction on high perf interconnects.
> > The two used approaches are 1) replace malloc() and friends, which break
> > with fortran 90 compilers 2) tell glibc never to release alloced memory
> > thru sbrk(-n) or munmap() which also break with f90 compilers, and run
> > the risk of bloating memory usage.
>
> Actually there is a third. Hack the vm layer and require a highly
> patched kernel. That is the approach quadrics was using last time I
> looked although they promised something different in their next major
> rev.
>
> Is it pgi or intels f90 compilers that break, and how do they break.
> Replacing malloc and friends should be well defined if you simply
> replace or wrap the symbols glibc provides.
>
> Quite possibly the answer is to call those compilers ABI
> non-conformant and get them fixed. Especially given that they are not
> compatible with g77 in fortran mode there is a good case for this. By
> default the native compiler is correct.
>
> So far the only fortran issues I have seen that could affect malloc
> are adding extra under scores. What issue are you running into?
>
Some don't use (g)libc, but do syscalls directly.
>
> Eric
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
Terje Eggestad <[email protected]> writes:
> On Tue, 2003-05-06 at 11:21, Eric W. Biederman wrote:
> > Terje Eggestad <[email protected]> writes:
> >
> > >
> > > > I believe the answer on how to do a clean safe interface is
> > > > to allocate the memory and tell the card about it in the driver,
> > > > and then allow user space to mmap it. With the driver mmap operation
> > > > informing the network card of the mapping.
> > > >
> > >
> > > You can't mmap() a buffer every time your going to do a send/recv, it's
> > > way to costly.
> >
> > Definitely not. But if the memory malloc returns is originally
> > from a mmaped buffer area (mmaped from your driver) it can be useful.
> > I assume somewhere your card has the smarts to transform virtual to
> > physical addresses and this is what the mmap sets up.
> >
>
> The problem I've got happen when an app registers the memory with the
> driver, releases the memory back to the kernel thru sbrk(-n) or
> munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
> the same vaddr.
>
> mapping from vaddr to phys addr happen at the registration point.
I was talking about an method that does not require a registration
point. So it sounds like we are talking past each other on this one.
> Querying the kernel for a vaddrs phys addr every time it's used is too
> costly. There is a better explanantion in a earlier post.
There are 2 possible interfaces to get a vaddr to phys addr mapping.
1) Register the memory area and pin it down.
2) MMap from memory allocated by the driver.
In this case the driver is told about every mmap and unmap.
So I believe that baring the strange issues with hooking malloc
to call a mmap function on your driver 2 is the correct solution.
> > That can be handled in user space by querying the mmaped region. But
> > if the card does not have the smarts to do the virtual to physical
> > translation, or at the very least limit the set of physical pages a
> > user space a do DMA to/from that is a fundamental security issue and
> > means all of the optimizations are not safe. And you must enter/exit
> > the kernel to send a DMA transaction.
> >
>
> send/recv don't need kernel interaction on high perf interconnects.
Agreed. I was simply mention the requires for that to be safe.
> > So far the only fortran issues I have seen that could affect malloc
> > are adding extra under scores. What issue are you running into?
> >
>
> Some don't use (g)libc, but do syscalls directly.
That is clearly broken code. A user space application compiled statically is
one thing. Directly putting syscalls in libraries other than libc is
quite bad. And I currently cannot think of an excuse for it.
Especially as that will ensure you use the slow syscall path on recent
kernels.
Eric
On Tue, 2003-05-06 at 13:37, Eric W. Biederman wrote:
> Terje Eggestad <[email protected]> writes:
> >
> > The problem I've got happen when an app registers the memory with the
> > driver, releases the memory back to the kernel thru sbrk(-n) or
> > munmap()s it. Then get new memory thru sbrk(+n) or mmap() which then get
> > the same vaddr.
> >
> > mapping from vaddr to phys addr happen at the registration point.
>
> I was talking about an method that does not require a registration
> point. So it sounds like we are talking past each other on this one.
>
> > Querying the kernel for a vaddrs phys addr every time it's used is too
> > costly. There is a better explanantion in a earlier post.
>
> There are 2 possible interfaces to get a vaddr to phys addr mapping.
> 1) Register the memory area and pin it down.
> 2) MMap from memory allocated by the driver.
> In this case the driver is told about every mmap and unmap.
>
> So I believe that baring the strange issues with hooking malloc
> to call a mmap function on your driver 2 is the correct solution.
>
Well, since the memory is already alloc'ed as normal user memory, it
gotta be 1), which require a registration point.
> > > That can be handled in user space by querying the mmaped region. But
> > > if the card does not have the smarts to do the virtual to physical
> > > translation, or at the very least limit the set of physical pages a
> > > user space a do DMA to/from that is a fundamental security issue and
> > > means all of the optimizations are not safe. And you must enter/exit
> > > the kernel to send a DMA transaction.
> > >
> >
> > send/recv don't need kernel interaction on high perf interconnects.
>
> Agreed. I was simply mention the requires for that to be safe.
>
> > > So far the only fortran issues I have seen that could affect malloc
> > > are adding extra under scores. What issue are you running into?
> > >
> >
> > Some don't use (g)libc, but do syscalls directly.
>
> That is clearly broken code. A user space application compiled statically is
> one thing. Directly putting syscalls in libraries other than libc is
> quite bad. And I currently cannot think of an excuse for it.
> Especially as that will ensure you use the slow syscall path on recent
> kernels.
>
Agree, come to think about it, if you write code in fortran it's broken
by default ;-)
The thing is of course that pesky customers have fortran code they need
to run, and as long there is no g90, and g77 performance sucks, there is
only commercial fortran compilers in play....
> Eric
TJ
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Tue, 2003-05-06 at 01:45, Yoav Weiss wrote:
> In fact, in linux which is opensource, you can probably write a script
> that extracts any unexported symbol from the source code, find a path to
> it from some exported symbol, and automagically create a module that
> re-exports this symbol for your legacy driver to use.
You might have a derivative work after obtaining access to a
non-exported interface. If this is correct, binary-only modules
can't do this and therefore they must stick to exported interfaces.
--
David S. Miller <[email protected]>
> You might have a derivative work after obtaining access to a
> non-exported interface. If this is correct, binary-only modules
> can't do this and therefore they must stick to exported interfaces.
Thats an interesting question. Who violates the license here ? It can't
be the author of the binary driver (unless it was in breach before the
symbol was unexported). Thats because it didn't change. The user,
wishing to keep using his driver although the kernel changed and broke it,
generates and insmod's a module that re-exports a symbol that the module
relies upon. However, the user didn't release any code so he can't be in
breach either.
Its just a method backwards compatibility of kernel modules. Of course,
IANAL, so I may be wrong here.
One could argue that the binary module was in breach in the first place,
because of various reasons. My point is that the re-exporting module
didn't change anything in terms of derived work.
Yoav Weiss
It's much simpler than that: Do either
nm vmlinux | grep sys_call_table
or
grep sys_call_table System.map
extract the address, use the header file to get the syscall number and
the offset.
Of course this all breaks the GPL, but you can get any non-exported
symbol address that way.
======================================================================
Jerry Cooperstein, Senior Consultant, <[email protected]>
Axian, Inc., Software Consulting and Training
4800 SW Griffith Dr., Ste. 202, Beaverton, OR 97005 USA
http://www.axian.com/
======================================================================
On Tue, May 06, 2003 at 11:45:41AM +0300, Yoav Weiss wrote:
> > But how? When some global will not be exported, it would not be listed
> > in /proc/ksyms.
>
> So what ?
> You just find the right address (in this case by getting the addresses of
> exported syscalls and finding a list in memory, containing them in the
> right order), and cast it to be the syscall table. If you want it to work
> with a binary-only driver, you can even insmod a small module that does
> that and adds the result to the symbol table for other modules to use.
>
> We've been doing that for years on closed-source systems like AIX. The
> above is just one way to locate a struct in memory. A faster way is to
> find some exported structs which are known to point to the unexported
> symbol from some offset, extract the symbol's address, and "re-export" it.
>
> In fact, in linux which is opensource, you can probably write a script
> that extracts any unexported symbol from the source code, find a path to
> it from some exported symbol, and automagically create a module that
> re-exports this symbol for your legacy driver to use.
>
> If you write the script, don't forget to GPL it :)
>
> Yoav Weiss
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, 6 May 2003, Jerry Cooperstein wrote:
> It's much simpler than that: Do either
>
> nm vmlinux | grep sys_call_table
>
> or
>
> grep sys_call_table System.map
>
> extract the address, use the header file to get the syscall number and
> the offset.
You're right but only in case System.map or vmlinux are available. In
some distros you only have the bzImage/vmlinuz, and still want to load
some module, without replacing the kernel.
My proposed script would derive this info from exported symbols in the
running kernel, so its more portable. Another advantage it has is gaining
access to non-globals. As long as they're referred by some exported
struct, even indirectly, they can be re-exported as globals. (Not that
I'd do it or recommend it to anyone :)
>
> Of course this all breaks the GPL, but you can get any non-exported
> symbol address that way.
It violates the GPL only if you distribute the resulting module. As long
as you run the script locally, generate the module locally, and only use
it locally, I don't see how it violates anything. GPL is a license for
distributors, not users.
Yoav Weiss
> On Tue, 2003-05-06 at 01:45, Yoav Weiss wrote:
> > In fact, in linux which is opensource, you can probably write a script
> > that extracts any unexported symbol from the source code, find a path to
> > it from some exported symbol, and automagically create a module that
> > re-exports this symbol for your legacy driver to use.
> You might have a derivative work after obtaining access to a
> non-exported interface. If this is correct, binary-only modules
> can't do this and therefore they must stick to exported interfaces.
Obviously you don't have much experience getting around licenses. ;)
You GPL the part that does the dirty work. Then your closed-source module
only uses exported interfaces and the boundary between GPL and closed-source
code is a clear license boundary.
DS
> You might have a derivative work after obtaining access to a
> non-exported interface. If this is correct, binary-only modules
> can't do this and therefore they must stick to exported interfaces.
And what about modules that just hook syscall directly by hooking int
0x80 or messing with sysenter?
Hi,
I am interested with pt2, how NFS did for their syscall?
Terje Eggestad wrote:
> Unfortunately we live in an insane world.
>
> First of all, in the Changelog where the export was removed for 2.5.41
>
> http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41
>
> Arjan lists 4 reasons for having the export in the first place, and I'm
> on point 3. Here Arjan pretty much acknowledges that there is a
> legitimate need to have a event/hook system to be informed of a syscall.
> The exact quote is: "Eg the use of the export in this just a bandaid due
> to lack of a proper mechanism".
>
> My argument for *why* there should be a mechanism stops here.
>
>
> Since you're bright inquisitive: The exact problem I'm facing is pretty
> complex:
>
>
> 1. performance is everything.
> 2. We're making a MPI library, and as such we don't have any control
> with the application.
> 3a. The various hardware for cluster interconnect all work with DMA.
> 3b. the performance loss from copying from a receive area to the
> userspace buffer is unacceptable.
> 3c. It's therefore necessary for HW to access user pages.
> 4. In order to to 3, the user pages must be pinned down.
> 5. the way MPI is written, it's not using a special malloc() to allocate
> the send receive buffers. It can't since it would break language binding
> to fortran. Thus ANY writeable user page may be used.
> 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> buffers every time they're used.
> 7. The only way to cache buffers (to see if they're used before and
> hence pinned) is the user space virtual address. A syscall, thus ioctl
> to a device file is prohibitive expensive under point 1.
> 8a. if the app (glibc in practice, but you never know) use sbrk() with a
> negative arg, and then a positive argument, I can get a a different set
> of user pages with the same address.
> 8b ditto with a set of munmap()/mmap().
> 9. since the number of times. any 'realloc' may happen is << than the
> numbers of times any buffer may be used, it's necessary under point 1 to
> to trace changes to virtual addresses to phys pages, rather than test
> every time an address is being used.
> 10. kernel patches are impractical, I must be able to do this with std
> stock, redhat, AND suse kernels.
>
>
>
>
> On Mon, 2003-05-05 at 10:23, Christoph Hellwig wrote:
>
>>On Mon, May 05, 2003 at 10:19:45AM +0200, Terje Eggestad wrote:
>>
>>>Now that it seem that all are in agreement that the sys_call_table
>>>symbol shall not be exported to modules, are there any work in progress
>>>to allow modules to get an event/notification whenever a specific
>>>syscall is being called?
>>
>>No.
>>
>>
>>>We have a specific need to trace mmap() and sbrk() calls.
>>
>>Well, you get mmap events for your driver and I can't imagine a sane
>>reason for intwercepting sbrk(). Do you have a pointer to the driver
>>source doing such strange things?
It seems like nobody belives that there are any technically valid
reasons for hooking system calls, but how should e.g anti virus
on-access scanners intercept syscalls?
Preloading libraries, ptracing init, patching g/libc, etc. are
obviously not the way to go.
-p.
On Wed, 2003-05-07 at 17:34, petter wahlman wrote:
> It seems like nobody belives that there are any technically valid
> reasons for hooking system calls, but how should e.g anti virus
> on-access scanners intercept syscalls?
> Preloading libraries, ptracing init, patching g/libc, etc. are
> obviously not the way to go.
those obviously need to be implemented via the security subsystem (eg
LSM). Hooks are obviously the wrong level to do things and I could even
tell you that you cannot implement this right from a module actually.
On Wed, 7 May 2003, petter wahlman wrote:
>
> It seems like nobody belives that there are any technically valid
> reasons for hooking system calls, but how should e.g anti virus
> on-access scanners intercept syscalls?
> Preloading libraries, ptracing init, patching g/libc, etc. are
^^^^^^^^^^^^^^^^^^^
|________ Is the way to go. That's how
you communicate every system-call to a user-mode daemon that
does whatever you want it to do, including phoning the National
Security Administrator if that's the policy.
> obviously not the way to go.
>
Oviously wrong.
Also, there are existing system calls that are not in use.
You can modify your copy of a kernel for whatever you want.
Example system calls that simply return -ENOSYS are
break, stty, gtty, prof, acct, lock, and mpx. That should
be enough entry-points to muck with.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> On Wed, 7 May 2003, petter wahlman wrote:
>
> >
> > It seems like nobody belives that there are any technically valid
> > reasons for hooking system calls, but how should e.g anti virus
> > on-access scanners intercept syscalls?
> > Preloading libraries, ptracing init, patching g/libc, etc. are
> ^^^^^^^^^^^^^^^^^^^
> |________ Is the way to go. That's how
> you communicate every system-call to a user-mode daemon that
> does whatever you want it to do, including phoning the National
> Security Administrator if that's the policy.
>
> > obviously not the way to go.
> >
>
> Oviously wrong.
And how would you force the virus to preload this library?
-p.
On 7 May 2003, petter wahlman wrote:
>
> It seems like nobody belives that there are any technically valid
> reasons for hooking system calls, but how should e.g anti virus
> on-access scanners intercept syscalls?
> Preloading libraries, ptracing init, patching g/libc, etc. are
> obviously not the way to go.
>
Well, for a system wide system call hook, a kernel mechanism is necessary
(and useful too IMHO). However for our usage (MPI) it is enough to know
when the current process calls either sbrk(-n) or munmap glibc functions,
thus it is sufficient to implement some kind of callback functionality for
certain glibc functions, sort of like the malloc/free hooks but on a more
general basis since some applications doesn't use malloc/free but
implement their own alloc/free algorithms using the syscalls (one example
is f90 apps).
Ideas anyone ?
Regards,
--
Steffen Persvold | Scali AS
mailto:[email protected] | http://www.scali.com
Tel: (+47) 2262 8950 | Olaf Helsets vei 6
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY
On Wed, 7 May 2003, petter wahlman wrote:
> On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > On Wed, 7 May 2003, petter wahlman wrote:
> >
> > >
> > > It seems like nobody belives that there are any technically valid
> > > reasons for hooking system calls, but how should e.g anti virus
> > > on-access scanners intercept syscalls?
> > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > ^^^^^^^^^^^^^^^^^^^
> > |________ Is the way to go. That's how
> > you communicate every system-call to a user-mode daemon that
> > does whatever you want it to do, including phoning the National
> > Security Administrator if that's the policy.
> >
> > > obviously not the way to go.
> > >
> >
> > Oviously wrong.
>
>
> And how would you force the virus to preload this library?
>
> -p.
>
I wouldn't.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Wed, 7 May 2003, petter wahlman wrote:
> On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > On Wed, 7 May 2003, petter wahlman wrote:
> >
> > >
> > > It seems like nobody belives that there are any technically valid
> > > reasons for hooking system calls, but how should e.g anti virus
> > > on-access scanners intercept syscalls?
> > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > ^^^^^^^^^^^^^^^^^^^
> > |________ Is the way to go. That's how
> > you communicate every system-call to a user-mode daemon that
> > does whatever you want it to do, including phoning the National
> > Security Administrator if that's the policy.
> >
> > > obviously not the way to go.
> > >
> >
> > Oviously wrong.
>
>
> And how would you force the virus to preload this library?
>
> -p.
>
The same way you would force a virus to not be statically linked.
You make sure that only programs that interface with the kernel
thorugh your hooks can run on that particular system.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Wednesday 07 May 2003 11:08, petter wahlman wrote:
> On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > On Wed, 7 May 2003, petter wahlman wrote:
> > > It seems like nobody belives that there are any technically valid
> > > reasons for hooking system calls, but how should e.g anti virus
> > > on-access scanners intercept syscalls?
> > > Preloading libraries, ptracing init, patching g/libc, etc. are
> >
> > ^^^^^^^^^^^^^^^^^^^
> >
> > |________ Is the way to go. That's how
> >
> > you communicate every system-call to a user-mode daemon that
> > does whatever you want it to do, including phoning the National
> > Security Administrator if that's the policy.
> >
> > > obviously not the way to go.
> >
> > Oviously wrong.
>
> And how would you force the virus to preload this library?
You don't have to... The preload is performed by the program image loader,
before the virus, or even the application, can be started.
You don't really want to do it anyway... Consider a file open (like tar)...
you gonna try to scan the entire archive for a virus????
On Wed, 2003-05-07 at 18:59, Richard B. Johnson wrote:
> On Wed, 7 May 2003, petter wahlman wrote:
>
> > On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > > On Wed, 7 May 2003, petter wahlman wrote:
> > >
> > > >
> > > > It seems like nobody belives that there are any technically valid
> > > > reasons for hooking system calls, but how should e.g anti virus
> > > > on-access scanners intercept syscalls?
> > > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > > ^^^^^^^^^^^^^^^^^^^
> > > |________ Is the way to go. That's how
> > > you communicate every system-call to a user-mode daemon that
> > > does whatever you want it to do, including phoning the National
> > > Security Administrator if that's the policy.
> > >
> > > > obviously not the way to go.
> > > >
> > >
> > > Oviously wrong.
> >
> >
> > And how would you force the virus to preload this library?
> >
> > -p.
> >
>
> The same way you would force a virus to not be statically linked.
> You make sure that only programs that interface with the kernel
> thorugh your hooks can run on that particular system.
>
Can you please elaborate.
How would you implement the access control without modifying the
respective syscalls or the system_call(), and would you'r
solution be possible to implement run time?
Regards,
-p.
On Wed, 7 May 2003, petter wahlman wrote:
> On Wed, 2003-05-07 at 18:59, Richard B. Johnson wrote:
> > On Wed, 7 May 2003, petter wahlman wrote:
> >
> > > On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > > > On Wed, 7 May 2003, petter wahlman wrote:
> > > >
> > > > >
> > > > > It seems like nobody belives that there are any technically valid
> > > > > reasons for hooking system calls, but how should e.g anti virus
> > > > > on-access scanners intercept syscalls?
> > > > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > > > ^^^^^^^^^^^^^^^^^^^
> > > > |________ Is the way to go. That's how
> > > > you communicate every system-call to a user-mode daemon that
> > > > does whatever you want it to do, including phoning the National
> > > > Security Administrator if that's the policy.
> > > >
> > > > > obviously not the way to go.
> > > > >
> > > >
> > > > Oviously wrong.
> > >
> > >
> > > And how would you force the virus to preload this library?
> > >
> > > -p.
> > >
> >
> > The same way you would force a virus to not be statically linked.
> > You make sure that only programs that interface with the kernel
> > thorugh your hooks can run on that particular system.
> >
>
> Can you please elaborate.
> How would you implement the access control without modifying the
> respective syscalls or the system_call(), and would you'r
> solution be possible to implement run time?
>
> Regards,
>
The program loader for shared-library programs is ld.so or
ld-linux.so. It's the thing that mmaps the shared libraries
and, eventually calls _start: in the beginning of the program:
execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
brk(0) = 0x804c748
open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
close(3) = 0
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
>> Preloading libraries, ptracing init, patching g/libc, etc. are
>> obviously not the way to go.
>
> those obviously need to be implemented via the security subsystem (eg
> LSM). Hooks are obviously the wrong level to do things and I could even
> tell you that you cannot implement this right from a module actually.
What is really needed is some kind of proper generic hooking setup
that could be used both by LSM and other things. People doing this
may need to intercept syscalls both on their way to the kernel and
on the way back to userland (so they can see return codes.) They may
also need to say whether they want to be first or last if there are
multiple users of this facility.
But the real question is why the export of sys_call_table was so
gratuitously removed without any kind of replacement being offered.
And the attitude of the developers about it is truly awful. ("Oh, so
we broke the drivers you depend on for your livelihood? You can just
go get a new job -- pounding sand down a rathole.")
On Wednesday 07 May 2003 13:07, petter wahlman wrote:
> On Wed, 2003-05-07 at 18:59, Richard B. Johnson wrote:
>
[snip]
> > The same way you would force a virus to not be statically linked.
> > You make sure that only programs that interface with the kernel
> > thorugh your hooks can run on that particular system.
>
> Can you please elaborate.
> How would you implement the access control without modifying the
> respective syscalls or the system_call(), and would you'r
> solution be possible to implement run time?
Access control is available via the LSM, with well defined interfaces.
If that is what you want to control, then use the LSM, and not the syscall
table.
On Wed, 2003-05-07 at 20:33, Richard B. Johnson wrote:
> On Wed, 7 May 2003, petter wahlman wrote:
>
> > On Wed, 2003-05-07 at 18:59, Richard B. Johnson wrote:
> > > On Wed, 7 May 2003, petter wahlman wrote:
> > >
> > > > On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > > > > On Wed, 7 May 2003, petter wahlman wrote:
> > > > >
> > > > > >
> > > > > > It seems like nobody belives that there are any technically valid
> > > > > > reasons for hooking system calls, but how should e.g anti virus
> > > > > > on-access scanners intercept syscalls?
> > > > > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > > > > ^^^^^^^^^^^^^^^^^^^
> > > > > |________ Is the way to go. That's how
> > > > > you communicate every system-call to a user-mode daemon that
> > > > > does whatever you want it to do, including phoning the National
> > > > > Security Administrator if that's the policy.
> > > > >
> > > > > > obviously not the way to go.
> > > > > >
> > > > >
> > > > > Oviously wrong.
> > > >
> > > >
> > > > And how would you force the virus to preload this library?
> > > >
> > > > -p.
> > > >
> > >
> > > The same way you would force a virus to not be statically linked.
> > > You make sure that only programs that interface with the kernel
> > > thorugh your hooks can run on that particular system.
> > >
> >
> > Can you please elaborate.
> > How would you implement the access control without modifying the
> > respective syscalls or the system_call(), and would you'r
> > solution be possible to implement run time?
> >
> > Regards,
> >
>
> The program loader for shared-library programs is ld.so or
> ld-linux.so. It's the thing that mmaps the shared libraries
> and, eventually calls _start: in the beginning of the program:
>
> execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
> brk(0) = 0x804c748
> open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
> fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
> close(3) = 0
>
That would work on dynamically linked executables, but how do you control
access to file shares or static executables.? Denying access to the latter
would even prevent ldconfig from running.
Regards,
-p.
I guess something like this:
typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
void * arg4, void * arg5, void * arg6);
#define HOOK_IN_FLAG 0x1
#define HOOK_OUT_FLAG 0x2
opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
hook_function, int flags);
int unregister(int opaquehandle);
I'd make a stab at it if I knew that it stood a chance of getting
accepted.
TJ
On Wed, 2003-05-07 at 21:04, Chuck Ebbert wrote:
> >> Preloading libraries, ptracing init, patching g/libc, etc. are
> >> obviously not the way to go.
> >
> > those obviously need to be implemented via the security subsystem (eg
> > LSM). Hooks are obviously the wrong level to do things and I could even
> > tell you that you cannot implement this right from a module actually.
>
> What is really needed is some kind of proper generic hooking setup
> that could be used both by LSM and other things. People doing this
> may need to intercept syscalls both on their way to the kernel and
> on the way back to userland (so they can see return codes.) They may
> also need to say whether they want to be first or last if there are
> multiple users of this facility.
>
> But the real question is why the export of sys_call_table was so
> gratuitously removed without any kind of replacement being offered.
> And the attitude of the developers about it is truly awful. ("Oh, so
> we broke the drivers you depend on for your livelihood? You can just
> go get a new job -- pounding sand down a rathole.")
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Thu, May 08, 2003 at 11:58:33AM +0200, Terje Eggestad wrote:
> I guess something like this:
>
> typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
> void * arg4, void * arg5, void * arg6);
>
> #define HOOK_IN_FLAG 0x1
> #define HOOK_OUT_FLAG 0x2
>
> opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
> hook_function, int flags);
> int unregister(int opaquehandle);
>
> I'd make a stab at it if I knew that it stood a chance of getting
> accepted.
I dont think it has.
On Thu, May 08, 2003 at 09:59:43AM +0000, Arjan van de Ven wrote:
> On Thu, May 08, 2003 at 11:58:33AM +0200, Terje Eggestad wrote:
> > I guess something like this:
> >
> > typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
> > void * arg4, void * arg5, void * arg6);
> >
> > #define HOOK_IN_FLAG 0x1
> > #define HOOK_OUT_FLAG 0x2
> >
> > opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
> > hook_function, int flags);
> > int unregister(int opaquehandle);
> >
> > I'd make a stab at it if I knew that it stood a chance of getting
> > accepted.
>
> I dont think it has.
I think it could, actually - who maintains fortunes these days? It's
a bit too long, though...
[Alan Cox]
>> 10. kernel patches are impractical, I must be able to do this with std
>> stock, redhat, AND suse kernels.
> So you want every vendor to screw up their kernels and the base kernel
> for an obscure (but fun) corner case. Thats not a rational choice is it.
> You want "performance is everything" you pay the price, don't make
> everyone suffer.
Hmm. sys_call_table is gone? That's sad.
How about a
EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
and displaying a nasty warning message on the console whenever a
module used it?
It is rare that I need to use it, but when I do I need it bad, for instance:
fsync on large files used to have severe performance problems, I was
able to just change sys_fsync to be a call to sys_sync without
rebooting or even restarting the database(Solid) before the problem
got out of hand.
A server for an online internet game had several months of uptime and
I needed to rotate the log-files so I made a module which trapped
sys_write and closed and reopened the file with a new name before
continuing[1].
Even if it is discouraged for normal use it is a very nice thing to
have to fix up various surprises.
I know I can still use the Phrack technique, but somehow I am not
convinced that I can rely on it being available.
--
- Terje
[email protected]
[1] When I do this kind of thing now I do:
(gdb) attach 9597
(gdb) call close(7)
(gdb) call open("out.txt",0100 | 01, 0666 )
(gdb) cont
This did not work back then however.
Steffen Persvold <[email protected]> writes:
> On 7 May 2003, petter wahlman wrote:
>
> >
> > It seems like nobody belives that there are any technically valid
> > reasons for hooking system calls, but how should e.g anti virus
> > on-access scanners intercept syscalls?
> > Preloading libraries, ptracing init, patching g/libc, etc. are
> > obviously not the way to go.
> >
>
> Well, for a system wide system call hook, a kernel mechanism is necessary
> (and useful too IMHO). However for our usage (MPI) it is enough to know
> when the current process calls either sbrk(-n) or munmap glibc functions,
> thus it is sufficient to implement some kind of callback functionality for
> certain glibc functions, sort of like the malloc/free hooks but on a more
> general basis since some applications doesn't use malloc/free but
> implement their own alloc/free algorithms using the syscalls (one example
> is f90 apps).
>
> Ideas anyone ?
I think the complete list of functions to be hooked needs to be at least:
mmap(MAP_FIXED), munmap, sbrk(-n), shmat, shdt. The mapping cases
are needed because a mmap(MAP_FIXED) can implicitly unmap an area under
them, before the new address is used.
This is not a kernel issue as this is purely a user space problem,
the kernel provides all of the necessary functionality.
I suspect what is needed is something like:
int on_unmap(void (*func)(void *start, size_t length, void *), void *arg);
With the function called before the unmap actually occurs, that way
the multi thread case is safe. It needs to be built so that multiple libraries
can cooperate cleanly.
Ulrich what do you think. Is the above function reasonable?
Something like it is needed to manage caches of pinned memory for high
performance kernel bypass libraries.
Eric
On Thu, May 08, 2003 at 02:25:51PM +0200, Terje Malmedal wrote:
>
> EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
>
> and displaying a nasty warning message on the console whenever a
> module used it?
What about just adding the EXPORT_SYMBOL() yourself yo your kernels
if you think you need it so badly because you can't screw yourself
enough without it?
What really gets to me is that *you* wrote in
(http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.41):
3. Intercept system calls
OProfile (and intel's vtune which is similar in function) used to do this;
however what they really need is a notification on certain
events (exec() mostly). The way modules do this is store the original
function pointer, install a new one that calls the old one after storing
whatever info they need. This mechanism breaks badly in the light of
multiple such modules doing this versus modules
unloading/uninstalling their handlers (by restoring their saved pointer
that may or may not point to a valid handler anymore).
Eg the use of the export in this just a bandaid due to lack of a
proper mechanism, and also incorrect and crash prone.
So what you're saying here is not that you object to having people doing
syscall hooks, just that operating on the syscall_table symbol directly
is error prone (to which I wholeheartedly agree).
Then you reject a "proper mechanism".....
TJ
On Thu, 2003-05-08 at 11:59, Arjan van de Ven wrote:
> On Thu, May 08, 2003 at 11:58:33AM +0200, Terje Eggestad wrote:
> > I guess something like this:
> >
> > typedef int (*syscall_hook_t)(void * arg1, void * arg2, void * arg3,
> > void * arg4, void * arg5, void * arg6);
> >
> > #define HOOK_IN_FLAG 0x1
> > #define HOOK_OUT_FLAG 0x2
> >
> > opaquehandle = int register_syscall_hook(int syscall_nr, syscall_hook_t
> > hook_function, int flags);
> > int unregister(int opaquehandle);
> >
> > I'd make a stab at it if I knew that it stood a chance of getting
> > accepted.
>
> I dont think it has.
--
_________________________________________________________________________
Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________
On Thu, May 08, 2003 at 02:54:35PM +0200, Terje Eggestad wrote:
> So what you're saying here is not that you object to having people doing
> syscall hooks, just that operating on the syscall_table symbol directly
> is error prone (to which I wholeheartedly agree).
>
> Then you reject a "proper mechanism".....
Maybe you have a different notion of proper mechanism then everyone
else. BTW, you could easily have fixed your driver in the time you
spent trolling on lkml..
[Christoph Hellwig]
> On Thu, May 08, 2003 at 02:25:51PM +0200, Terje Malmedal wrote:
>>
>> EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
>>
>> and displaying a nasty warning message on the console whenever a
>> module used it?
> What about just adding the EXPORT_SYMBOL() yourself yo your kernels
> if you think you need it so badly because you can't screw yourself
> enough without it?
And if I wish to help somebody running a kernel I didn't compile?
Do you have anything constructive to say about situation i referred
to:
A database is starting to run slower and slower, turns out that this
is because fsync() is inefficient on large files. Rebooting the server
or restarting the database is undesirable even at night.
?
I was able to fix this without rebooting or restarting the database.
How do you propose to fix something similar today without having
sys_call_table exported?
Also what exactly is the badness people are complaining about, if I do:
int init_module(void)
{
orig_fsync=sys_call_table[SYS_fsync];
sys_call_table[SYS_fsync]=hacked_fsync;
return 0;
}
void cleanup_module(void)
{
sys_call_table[SYS_fsync]=orig_fsync;
}
The only problem I can see is that different modules overloading the
same function needs to be unloaded in the correct order. Is this the
only reason for removing it, or am I missing something?
--
- Terje
[email protected]
Al Viro wrote:
>> > I'd make a stab at it if I knew that it stood a chance of getting
>> > accepted.
>>
>> I dont think it has.
>
> I think it could, actually - who maintains fortunes these days? It's
> a bit too long, though...
Wow, Advanced Sarcasm. Must be part of the Graduate program...
Meanwhile on Win2k I can intercept any IO request by
wrting a filter driver, and that driver can get control on the way
back to userspace by registering a completion routine. Such filters
can be arbitrarily chained together and can be placed either above or
below an FSD, making such things as virus detection, HSM and disk
mirroring much easier to write...
How would I do this on Linux? How would virus detection and HSM
coexist? (HSM would have to be 'above' the virus detector, since it
makes no sense to try and scan a file that's been migrated until it
gets recalled back to disk.)
On Thu, May 08, 2003 at 03:18:29PM +0200, Terje Malmedal wrote:
> And if I wish to help somebody running a kernel I didn't compile?
recompile it. binary patch it. I don't care. Linux is free software
so you're allowed to change whatever you want. Just don't annoy us
about fixing problems in mainline.
> Do you have anything constructive to say about situation i referred
> to:
>
> A database is starting to run slower and slower, turns out that this
> is because fsync() is inefficient on large files. Rebooting the server
> or restarting the database is undesirable even at night.
fix the database. hey, if you think it's so important fork the kernel.
if there's enough people that agree with you wour fork will be mainline
some day. It's really _that_ easy.
> The only problem I can see is that different modules overloading the
> same function needs to be unloaded in the correct order. Is this the
> only reason for removing it, or am I missing something?
it's racy - and it doesn't work on half of the arches added over the
last years.
On Thu, May 08, 2003 at 10:08:37AM -0400, Chuck Ebbert wrote:
> Meanwhile on Win2k I can intercept any IO request by
> wrting a filter driver,
you can write a stackable filesystem on linux, too and intercept any
I/O request. You just have to do it through a sane interface, mount
and not by patching the syscall table - which you can do under
windows either. (at least not as part of the public API).
On Thursday 08 May 2003 09:08, Chuck Ebbert wrote:
> Al Viro wrote:
> >> > I'd make a stab at it if I knew that it stood a chance of getting
> >> > accepted.
> >>
> >> I dont think it has.
> >
> > I think it could, actually - who maintains fortunes these days? It's
> > a bit too long, though...
>
> Wow, Advanced Sarcasm. Must be part of the Graduate program...
>
> Meanwhile on Win2k I can intercept any IO request by
> wrting a filter driver, and that driver can get control on the way
> back to userspace by registering a completion routine. Such filters
> can be arbitrarily chained together and can be placed either above or
> below an FSD, making such things as virus detection, HSM and disk
> mirroring much easier to write...
note the key word in the phrase "filter DRIVER". Linux modules can intercep
any I/O directed toward them. and the filesystem layer can intercept any
filesystem call. And there are filesystem modules.
M$ seems to treat everything as a disk file (even "pipes" are implemented
as temporary files).
Have you tried catching the display IO ???
HSM has existed on UNIX based machines for a long time.
> How would I do this on Linux? How would virus detection and HSM
> coexist? (HSM would have to be 'above' the virus detector, since it
> makes no sense to try and scan a file that's been migrated until it
> gets recalled back to disk.)
I would expect the same way the NFS module interceps file system calls.
There is NO reason a custom filesystem cannot be layered over other
filesystems. It might not be done today (though the references to "userfs"
keep showing up in such discussions).
I do question the validity of virus detection though. Once examined, fix the
vulnerability. No more virus.
Virus detection can never be completely done. And it imposes a constantly
increasing overhead since you must be able to identify all pre-existing
viruses. This list of "pre-existing" will be constantly growing.
Fix the vulnerability. Then there won't be a virus.
On Thu, 8 May 2003, petter wahlman wrote:
> On Wed, 2003-05-07 at 20:33, Richard B. Johnson wrote:
> > On Wed, 7 May 2003, petter wahlman wrote:
> >
> > > On Wed, 2003-05-07 at 18:59, Richard B. Johnson wrote:
> > > > On Wed, 7 May 2003, petter wahlman wrote:
> > > >
> > > > > On Wed, 2003-05-07 at 18:00, Richard B. Johnson wrote:
> > > > > > On Wed, 7 May 2003, petter wahlman wrote:
> > > > > >
> > > > > > >
> > > > > > > It seems like nobody belives that there are any technically valid
> > > > > > > reasons for hooking system calls, but how should e.g anti virus
> > > > > > > on-access scanners intercept syscalls?
> > > > > > > Preloading libraries, ptracing init, patching g/libc, etc. are
> > > > > > ^^^^^^^^^^^^^^^^^^^
> > > > > > |________ Is the way to go. That's how
> > > > > > you communicate every system-call to a user-mode daemon that
> > > > > > does whatever you want it to do, including phoning the National
> > > > > > Security Administrator if that's the policy.
> > > > > >
> > > > > > > obviously not the way to go.
> > > > > > >
> > > > > >
> > > > > > Oviously wrong.
> > > > >
> > > > >
> > > > > And how would you force the virus to preload this library?
> > > > >
> > > > > -p.
> > > > >
> > > >
> > > > The same way you would force a virus to not be statically linked.
> > > > You make sure that only programs that interface with the kernel
> > > > thorugh your hooks can run on that particular system.
> > > >
> > >
> > > Can you please elaborate.
> > > How would you implement the access control without modifying the
> > > respective syscalls or the system_call(), and would you'r
> > > solution be possible to implement run time?
> > >
> > > Regards,
> > >
> >
> > The program loader for shared-library programs is ld.so or
> > ld-linux.so. It's the thing that mmaps the shared libraries
> > and, eventually calls _start: in the beginning of the program:
> >
> > execve("/bin/ps", ["ps"], [/* 32 vars */]) = 0
> > brk(0) = 0x804c748
> > open("/etc/ld.so.preload", O_RDONLY) = 3 <<<<<<--- your hooks here!!
> > fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> > old_mmap(NULL, 0, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0
> > close(3) = 0
> >
>
> That would work on dynamically linked executables, but how do you control
> access to file shares or static executables.? Denying access to the latter
> would even prevent ldconfig from running.
>
>
You can execute existing static-linked files by having ld.so execute
them. Ld.so "knows" how to execute static-linked files. You just
need to change kernel code to include the static executable magic
number with the dynamic linked magic number as requiring the
preprocessing of the dynamic linker.
The only problem is that 'init' won't start if that loader isn't
available. This not a problem for working systems. It's just a
problem for broken ones. You use an unpatched kernel for maintenance.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
[Christoph Hellwig]
>> The only problem I can see is that different modules overloading the
>> same function needs to be unloaded in the correct order. Is this the
>> only reason for removing it, or am I missing something?
> it's racy - and it doesn't work on half of the arches added over the
> last years.
Would you be so kind as to explain exactly what is racy? Just
asserting that it is does not help me understand anything.
--
- Terje
[email protected]
On Iau, 2003-05-08 at 15:08, Chuck Ebbert wrote:
> How would I do this on Linux? How would virus detection and HSM
> coexist? (HSM would have to be 'above' the virus detector, since it
> makes no sense to try and scan a file that's been migrated until it
> gets recalled back to disk.)
Userspace
--- ptrace
VFS
Loadable file system module (which can be made to stack stuff)
Block Layer
Loadable disk driver (Which can be made to stack)
Disk
On Iau, 2003-05-08 at 13:25, Terje Malmedal wrote:
> How about a
>
> EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
Its in read only space nowdays anyway
> A server for an online internet game had several months of uptime and
> I needed to rotate the log-files so I made a module which trapped
> sys_write and closed and reopened the file with a new name before
> continuing[1].
man ptrace
> M$ seems to treat everything as a disk file (even "pipes" are implemented
> as temporary files).
So did original Unix, it was a disk file that was anonymous and just
used the direct pointers to blocks for a ring buffer. Storing pipe data
in RAM in the old days was a hideous waste of resources.
> There is NO reason a custom filesystem cannot be layered over other
> filesystems. It might not be done today (though the references to "userfs"
> keep showing up in such discussions).
Erez Zadoz (not sure of the spelling) did some stacking fs modules on
Linux
> Fix the vulnerability. Then there won't be a virus.
But you don't know if its fixed and if there are any more holes without
being able to detect attackers be they electronic or human.
Good day, all,
On 8 May 2003, Alan Cox wrote:
> > There is NO reason a custom filesystem cannot be layered over other
> > filesystems. It might not be done today (though the references to "userfs"
> > keep showing up in such discussions).
>
> Erez Zadoz (not sure of the spelling) did some stacking fs modules on
> Linux
Erez Zadok maintains the FiST (File System Translator) project at
http://www1.cs.columbia.edu/~ezk/research/fist/ . For those not familiar
with the project, one writes an upper level filesystem that can modify VFS
requests or results, providing a VFS proxy.
Cheers,
- Bill
---------------------------------------------------------------------------
"All programs evolve until they can send email."
-- Richard Letts
"Except Microsoft Exchange."
-- Art
(found on the Snort web site)
--------------------------------------------------------------------------
William Stearns ([email protected]). Mason, Buildkernel, freedups, p0f,
rsync-backup, ssh-keyinstall, dns-check, more at: http://www.stearns.org
Linux articles at: http://www.opensourcedigest.com
--------------------------------------------------------------------------
On Thursday 08 May 2003 10:29, Terje Malmedal wrote:
> [Christoph Hellwig]
>
> >> The only problem I can see is that different modules overloading the
> >> same function needs to be unloaded in the correct order. Is this the
> >> only reason for removing it, or am I missing something?
> >
> > it's racy - and it doesn't work on half of the arches added over the
> > last years.
>
> Would you be so kind as to explain exactly what is racy? Just
> asserting that it is does not help me understand anything.
Look at this:
[1]int init_module(void)
[2]{
[3] orig_fsync=sys_call_table[SYS_fsync];
[4] sys_call_table[SYS_fsync]=hacked_fsync;
[5] return 0;
[6]}
Unless there is a LOCK on sys_call_table[SYS_fsync] another CPU could
replace the pointer between lines 3 and 4. At that point line 4 would
destroy the existing entry.. or destroy it when the original is restored,
and would NOT be restoring the one insterted.
On Thursday 08 May 2003 10:22, Alan Cox wrote:
[snip]
> > Fix the vulnerability. Then there won't be a virus.
>
> But you don't know if its fixed and if there are any more holes without
> being able to detect attackers be they electronic or human.
Detecting attackers is a different situation. An attack that is already fixed
is not a serious problem other than bandwidth. Virus scanners can't do that
anyway - they can only detect what has already been detected... and which
should have been fixed by the time the signature could have been put out,
anyway. Detection should be part of an intrusion facility (isn't LIDS supposed
to do that?)
Second, I want to setup SELinux to sandbox various facilities anyway (delayed
due to job change). That should isolate any unknown attack to just one
service, and protect the overall system.
Christoph Hellwig wrote:
>Maybe you have a different notion of proper mechanism then everyone
>else.
>
Out of personal interest - would a mechanism that promised the following
be considered a "proper mechanism"?
1. Work on all platforms.
2. Allow load and unload in arbitrary order and timings (which also
means "be race free").
3. Have low/zero overhead if not used
Would you also require:
4. Have reasonable overhead when used
a "must have" demand? Would, on the other hand, a:
4b. Have zero overhead when used for functions not hooked
be an alternative demand?
I'm currently trying to work with some other subscribers of this list on
a design. Getting 1, 2 and 3 is a complicated enough task, of course. I
would like to hear estimates about inclusion chances should we manage to
come up with an implmentation that lives up to all the above.
Thanks,
Shachar
--
Shachar Shemesh
Open Source integration consultant
Home page & resume - http://www.shemesh.biz/
On Thu, May 08, 2003 at 10:10:21PM +0300, Shachar Shemesh wrote:
> Christoph Hellwig wrote:
>
> >Maybe you have a different notion of proper mechanism then everyone
> >else.
> >
> Out of personal interest - would a mechanism that promised the following
> be considered a "proper mechanism"?
> 1. Work on all platforms.
> 2. Allow load and unload in arbitrary order and timings (which also
> means "be race free").
> 3. Have low/zero overhead if not used
No, the most important point is that a proper meachanism wouldn't
replace syscall slots but rather operate on kernel objects (file, inode
vma, task_struct, etc..). Linus has expressed a few times that
he has no interest in loadable syscalls and any core developer I've
talked to agrees with that.
On Thu, May 08, 2003 at 01:13:49PM -0500, Jesse Pollard wrote:
> Unless there is a LOCK on sys_call_table[SYS_fsync] another CPU could
> replace the pointer between lines 3 and 4. At that point line 4 would
> destroy the existing entry.. or destroy it when the original is restored,
> and would NOT be restoring the one insterted.
The the race in the replacement. The second race is in actually
using these hooks. As soon as you examine a user pointer/address
in there you're fundamentally racy vs. another thread manipulating
the user address space.
> Userspace
> --- ptrace
Ptrace appears to be effectively broken on 2.4.21-rc -- I can't strace
child processes that fork even as root, anyway.
> Block Layer
> Loadable disk driver (Which can be made to stack)
I'm sorry but I've been looking at the md code for about six months
and the 'big picture' of how it's doing what it does escapes me. The
code in md.c:lock_rdev(), for example -- looks like an incredibly deep
understanding of how all the block code works is needed to write a
driver like this.
> you can write a stackable filesystem on linux, too and intercept any
> I/O request. You just have to do it through a sane interface, mount
> and not by patching the syscall table - which you can do under
> windows either. (at least not as part of the public API).
So when I register my filesystem, can I indicate that I want to be
layered over top of the ext3 driver and get control anytime someone
mounts an ext3 fileystem, so I can decide whether the volume being
mounted is one that I want to intercept open/read/write requests for?
On Thu, May 08, 2003 at 03:43:37PM -0400, Chuck Ebbert wrote:
> > you can write a stackable filesystem on linux, too and intercept any
> > I/O request. You just have to do it through a sane interface, mount
> > and not by patching the syscall table - which you can do under
> > windows either. (at least not as part of the public API).
>
> So when I register my filesystem, can I indicate that I want to be
> layered over top of the ext3 driver
Yes.
> and get control anytime someone
> mounts an ext3 fileystem,
no.
> Have you tried catching the display IO ???
Not in a million years -- display drivers work by pure magic AFAIC.
> HSM has existed on UNIX based machines for a long time.
Show me three HSM implementations for Linux and I'll show you three
different mechanisms. :)
On Thu, May 08, 2003 at 03:43:38PM -0400, Chuck Ebbert wrote:
> > HSM has existed on UNIX based machines for a long time.
>
> Show me three HSM implementations for Linux and I'll show you three
> different mechanisms. :)
http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.4-xfs/linux/fs/xfs/dmapi/
for the XFS dmapi implementaion. Both SGI and IBM will sell you full
fledged HSM implementations built ontop of that..
On 05.08, Christoph Hellwig wrote:
> On Thu, May 08, 2003 at 10:10:21PM +0300, Shachar Shemesh wrote:
> > Christoph Hellwig wrote:
> >
> > >Maybe you have a different notion of proper mechanism then everyone
> > >else.
> > >
> > Out of personal interest - would a mechanism that promised the following
> > be considered a "proper mechanism"?
> > 1. Work on all platforms.
> > 2. Allow load and unload in arbitrary order and timings (which also
> > means "be race free").
> > 3. Have low/zero overhead if not used
>
> No, the most important point is that a proper meachanism wouldn't
> replace syscall slots but rather operate on kernel objects (file, inode
> vma, task_struct, etc..). Linus has expressed a few times that
> he has no interest in loadable syscalls and any core developer I've
> talked to agrees with that.
>
Don't have followed the whole thread, so I don't know if somebody has already
said this, but all this thing about hooks looks perfect for projects like
bproc or mosix, have you talked to them ?
(perhaps Erik Hendriks <[email protected]> -bproc- is following the thread...;) )
--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc1-jam2 (gcc 3.2.2 (Mandrake Linux 9.2 3.2.2-5mdk))
On Iau, 2003-05-08 at 20:43, Chuck Ebbert wrote:
> > you can write a stackable filesystem on linux, too and intercept any
> > I/O request. You just have to do it through a sane interface, mount
> > and not by patching the syscall table - which you can do under
> > windows either. (at least not as part of the public API).
>
> So when I register my filesystem, can I indicate that I want to be
> layered over top of the ext3 driver and get control anytime someone
> mounts an ext3 fileystem, so I can decide whether the volume being
> mounted is one that I want to intercept open/read/write requests for?
That would assume you had a right to dictate that the administrator
couldnt mount other file systems without your stacking.
On Thu, May 08, 2003 at 08:15:09PM +0100, Christoph Hellwig wrote:
> No, the most important point is that a proper meachanism wouldn't
> replace syscall slots but rather operate on kernel objects (file, inode
> vma, task_struct, etc..). Linus has expressed a few times that
> he has no interest in loadable syscalls and any core developer I've
> talked to agrees with that.
For some usages, hijacking syscalls, and not kernel objects, is the
desired outcome. For example, ptrace is great for telling you what a
given process (or its children) did, but it's entirely inadequate for
telling you *which* process did something. Something, in this case,
which doesn't have an associated kernel object.
For example, a rogue process is calling settimeofday() on your router
once a month(!). How are you going to find it? There's no LSM hook for
settimeofday() or any other way to say "don't do that", if it's
running as root. Using syscalltrack, or anything else which hijacks
system calls, not just kernel object, finding the culprit is trivial.
I've been staying out of this discussion, even though I have an
interest in its outcome. Talking about it is completely pointless
until someone writes a proper, *technically correct*, system call
hijacking interface. Then we can argue about whether or not it should
go in.
--
Muli Ben-Yehuda
http://www.mulix.org
On Thu, May 08, 2003 at 11:48:11PM +0200, J.A. Magallon wrote:
> Don't have followed the whole thread, so I don't know if somebody has already
> said this, but all this thing about hooks looks perfect for projects like
> bproc or mosix, have you talked to them ?
> (perhaps Erik Hendriks <[email protected]> -bproc- is following the
thread...;) )
I don't know about bproc, but at least MOSIX is, as it currently
stands, a kernel patch. Therefore, they can (and do) hijack the
syscall table safely. If I remember the code correctly, they do it in
entry.S, not in the sys_call_table itself.
--
Muli Ben-Yehuda
http://www.mulix.org
Christoph Hellwig wrote:
>> So when I register my filesystem, can I indicate that I want to be
>> layered over top of the ext3 driver
>
> Yes.
>
>> and get control anytime someone
>> mounts an ext3 fileystem,
>
> no.
Does a layered filesystem get all requests for ext3 IO if it's above
it then, or does someone have to manually mount it for each volume?
On Fri, May 09, 2003 at 03:50:31AM -0400, Chuck Ebbert wrote:
> Does a layered filesystem get all requests for ext3 IO if it's above
> it then, or does someone have to manually mount it for each volume?
after you mounted it you get all I/O requests below the mountpoint.
---end quoted text---
Alan Cox wrote:
>> So when I register my filesystem, can I indicate that I want to be
>> layered over top of the ext3 driver and get control anytime someone
>> mounts an ext3 fileystem, so I can decide whether the volume being
>> mounted is one that I want to intercept open/read/write requests for?
>
> That would assume you had a right to dictate that the administrator
> couldnt mount other file systems without your stacking.
Security-sensitive upper layers like virus scanners and loggers
would want to do it that way. The upper layer might even just log
the fact that mount happened and then stay out of the way after that.
On Fri, May 09, 2003 at 03:50:31AM -0400, Chuck Ebbert wrote:
> Security-sensitive upper layers like virus scanners and loggers
> would want to do it that way. The upper layer might even just log
> the fact that mount happened and then stay out of the way after that.
Maybe _they_ want it. We don't want it, though.
[Alan Cox]
> On Iau, 2003-05-08 at 13:25, Terje Malmedal wrote:
>> How about a
>>
>> EXPORT_SYMBOL_GPL_AND_DONT_EVEN_THINK_ABOUT_SENDING_A_BUG_REPORT(sys_call_table);
> Its in read only space nowdays anyway
>> A server for an online internet game had several months of uptime and
>> I needed to rotate the log-files so I made a module which trapped
>> sys_write and closed and reopened the file with a new name before
>> continuing[1].
> man ptrace
Did not work on multi-threaded programs at the time(at least strace
didn't), nowadays I'd use gdb as I explained in the original mail.
The point is that I used to get a convenient mechanism for working
around or fixing bugs in the kernel and applications without any
downtime. This is are very useful capability to have even if it is
only needed rarely.
Is there any reasonable way to be able to do modify a running kernel
like this? I assume I can't count on the method from Phrack working
forever...
--
- Terje
[email protected]
>> Does a layered filesystem get all requests for ext3 IO if it's above
>> it then, or does someone have to manually mount it for each volume?
>
> after you mounted it you get all I/O requests below the mountpoint.
So it's not 'layer a filesystem over another one' it's 'mount an
instance of a filesystem over another instance' then. And this means
it gets mounted twice with two different mountpoint names, right?
[Jesse Pollard]
> On Thursday 08 May 2003 10:29, Terje Malmedal wrote:
>> [Christoph Hellwig]
>>
>> >> The only problem I can see is that different modules overloading the
>> >> same function needs to be unloaded in the correct order. Is this the
>> >> only reason for removing it, or am I missing something?
>> >
>> > it's racy - and it doesn't work on half of the arches added over the
>> > last years.
>>
>> Would you be so kind as to explain exactly what is racy? Just
>> asserting that it is does not help me understand anything.
> Look at this:
> [1]int init_module(void)
> [2]{
> [3] orig_fsync=sys_call_table[SYS_fsync];
> [4] sys_call_table[SYS_fsync]=hacked_fsync;
> [5] return 0;
> [6]}
> Unless there is a LOCK on sys_call_table[SYS_fsync] another CPU could
> replace the pointer between lines 3 and 4. At that point line 4 would
> destroy the existing entry.. or destroy it when the original is restored,
> and would NOT be restoring the one insterted.
Yes, that's actually part of what I meant with my first quoted
paragraph above. Did not come out that way, sorry.
I can see one problem on unload, if the memory used by the module is
freed while another CPU is executing the module code, but it should be
easy enough to protect against with a lock.
Assuming that loads are unloads are manually serialized in the
correct order are there any other problems?
--
- Terje
[email protected]
Terje wrote:
> Is there any reasonable way to be able to do modify a running kernel
> like this? I assume I can't count on the method from Phrack working
> forever...
The Phrack method involves following int 0x80 and then looking for
an instruction in the syscall code that points to the table. (Check the
archives for pt_fix.c that I posted about a month ago.) Note that it's
trivial to break this too; I planned to post a patch to do just that
but never got around to it...
On Fri, May 09, 2003 at 05:11:57AM -0400, Chuck Ebbert wrote:
> So it's not 'layer a filesystem over another one' it's 'mount an
> instance of a filesystem over another instance' then. And this means
> it gets mounted twice with two different mountpoint names, right?
it gets mounted twice with either the same or different mountpoint
names. You can have multiple mountspoints with the same path, only
the topmost one will be seen by userland.
Shachar Shemesh wrote:
> I'm currently trying to work with some other subscribers of this list on
> a design. Getting 1, 2 and 3 is a complicated enough task, of course. I
> would like to hear estimates about inclusion chances should we manage to
> come up with an implmentation that lives up to all the above.
How many users would want to actually modify the syscall parameters
or change visible system behavior when a syscall happens?
Maybe something like this would work?
1. You can register to be notified when a syscall occurs,
either before or after or both.
2. The only action you can take must be 'private' (within
your driver or subsystem.)
Christoph Hellwig wrote:
> You can have multiple mountspoints with the same path, only
> the topmost one will be seen by userland.
What keeps users from opening files before the upper layer
filesystems get mounted? And how do you handle user-mountable
media like CD-ROMS?
On Fri, May 09, 2003 at 08:41:13AM -0400, Chuck Ebbert wrote:
> Christoph Hellwig wrote:
>
> > You can have multiple mountspoints with the same path, only
> > the topmost one will be seen by userland.
>
> What keeps users from opening files before the upper layer
> filesystems get mounted?
Nothing. Why would we want to do such silly things?
> And how do you handle user-mountable
> media like CD-ROMS?
look at supermount for a stackable filesystem that does nothing but
dealing with such media :) It also shows how the underlying fs
can be mounted without ever exposing it to userspace..
On Gwe, 2003-05-09 at 08:50, Chuck Ebbert wrote:
> Security-sensitive upper layers like virus scanners and loggers
> would want to do it that way. The upper layer might even just log
> the fact that mount happened and then stay out of the way after that.
What makes you say that. If the administrator has full priviledges then
its kind of irrelevant trying to force anything "for security reasons"
On Thursday 08 May 2003 14:43, Chuck Ebbert wrote:
> > Have you tried catching the display IO ???
>
> Not in a million years -- display drivers work by pure magic AFAIC.
>
> > HSM has existed on UNIX based machines for a long time.
>
> Show me three HSM implementations for Linux and I'll show you three
> different mechanisms. :)
Actually... I think they all use the same one (Even the Solaris/IRIX/Cray ones
do that). All of them provide a filesystem interface via VFS. The Linux ones
were implemented via the "userfs" core or NFS.
There is also OpenAFS which gives access to a remote HSM... but that can
be considered just a NFS equivalent.
On Fri, May 09, 2003 at 08:53:04AM -0500, Jesse Pollard wrote:
> On Thursday 08 May 2003 14:43, Chuck Ebbert wrote:
> > > Have you tried catching the display IO ???
> >
> > Not in a million years -- display drivers work by pure magic AFAIC.
> >
> > > HSM has existed on UNIX based machines for a long time.
> >
> > Show me three HSM implementations for Linux and I'll show you three
> > different mechanisms. :)
>
> Actually... I think they all use the same one (Even the Solaris/IRIX/Cray ones
> do that). All of them provide a filesystem interface via VFS. The Linux ones
> were implemented via the "userfs" core or NFS.
I'm not sure what you mean by "via" VFS, but most HSM implementations on linux
require extra interfaces and special support in the filesystem (XDSM or propriatary).
The only exceptions I know are openXDSM which was intended to be a
generic interface in the VFS-layer (but we never got time to implement
it) and a implementation based on stackable filesystem.
I don't know if the later is actually in use anywere, or if it is
abandoned.
--
Ragnar Kj?rstad
Zet.no
On Fri, May 09, 2003 at 10:42:08AM +0300, Muli Ben-Yehuda wrote:
>
> For example, a rogue process is calling settimeofday() on your router
> once a month(!). How are you going to find it? There's no LSM hook for
> settimeofday()
Yes there is. Check the capable hook for CAP_SYS_TIME. LSM modules can
get that info quite easily.
thanks,
greg k-h
Alan Cox wrote:
>> Security-sensitive upper layers like virus scanners and loggers
>> would want to do it that way. The upper layer might even just log
>> the fact that mount happened and then stay out of the way after that.
>
> What makes you say that. If the administrator has full priviledges then
> its kind of irrelevant trying to force anything "for security reasons"
Check out the NSA's guide for securing Win2k machines sometime. They
go through all kinds of steps to separate auditing and administration
even though an administrator can get around them and play with the audit
trail anyway. It raises the bar and removes the defense of plausible
deniability if an admin gets caught (he can hardly claim it was an
'accident' that he granted himself audit privileges and then used them
to tamper with the audit log.)
1. Create a new group: Auditors
2. Grant these rights to Auditors:
Take ownership of files; Manage auditing
3. Create a new user: Auditor, and put it in these groups:
Users; Auditors
4. Log on as Auditor and take ownership of
%systemroot%\system32\config\SecEvent.Evt
5. Set permissions on that security logfile:
a. System: full control
b. Administrators: no access
c. Auditors: full control
6. Now log on as an administrator and take away these rights:
a. from Administrators: Manage auditing
b. from Auditors: Take ownership of files
7. Enable these extra security options:
a. crash on audit failure
b. clear page file on shutdown
c. full privilege auditing
d. lots more...
After setting up auditing and ACLs (many pages of directions for that)
the audit duties are done by unprivileged users and administrators
cannot see or alter the audit trail.
Seems like a lot of useless work given that the admins can grant
themselves any rights they want, doesn't it?
On Fri, 09 May 2003 13:18:38 BST, Alan Cox said:
> What makes you say that. If the administrator has full priviledges then
> its kind of irrelevant trying to force anything "for security reasons"
Many security models require that there *not* be one person who has "full"
privileges (for obvious reasons).
Christoph Hellwig wrote:
> > What keeps users from opening files before the upper layer
> > filesystems get mounted?
>
> Nothing. Why would we want to do such silly things?
If I installed a virus scanner I would hope it could do that, otherwise
I might screw up and forget to set up all the layering just once and
get infected. If I can tell it "protect all local filesystems" then
I can forget about it from then on.
On Fri, 9 May 2003, Chuck Ebbert wrote:
> Alan Cox wrote:
>
> >> Security-sensitive upper layers like virus scanners and loggers
> >> would want to do it that way. The upper layer might even just log
> >> the fact that mount happened and then stay out of the way after that.
> >
> > What makes you say that. If the administrator has full priviledges then
> > its kind of irrelevant trying to force anything "for security reasons"
>
> Check out the NSA's guide for securing Win2k machines sometime. They
> go through all kinds of steps to separate auditing and administration
> even though an administrator can get around them and play with the audit
> trail anyway. It raises the bar and removes the defense of plausible
> deniability if an admin gets caught (he can hardly claim it was an
> 'accident' that he granted himself audit privileges and then used them
> to tamper with the audit log.)
>
> 1. Create a new group: Auditors
> 2. Grant these rights to Auditors:
> Take ownership of files; Manage auditing
> 3. Create a new user: Auditor, and put it in these groups:
> Users; Auditors
> 4. Log on as Auditor and take ownership of
> %systemroot%\system32\config\SecEvent.Evt
> 5. Set permissions on that security logfile:
> a. System: full control
> b. Administrators: no access
> c. Auditors: full control
> 6. Now log on as an administrator and take away these rights:
> a. from Administrators: Manage auditing
> b. from Auditors: Take ownership of files
> 7. Enable these extra security options:
> a. crash on audit failure
> b. clear page file on shutdown
> c. full privilege auditing
> d. lots more...
>
> After setting up auditing and ACLs (many pages of directions for that)
> the audit duties are done by unprivileged users and administrators
> cannot see or alter the audit trail.
>
> Seems like a lot of useless work given that the admins can grant
> themselves any rights they want, doesn't it?
It's to make it look like it's secure. If you have a system
with no executable software, other than what is built in,
that can't execute other software because there is no built-
in code to do it, the system is still only secure when it's
powered OFF. Microsoft installed Magic Lantern software within
the kernel and within all software updates, (service Pack 2 of
Win/2000/Prof as part of their deal with the Justice Department
when our attention was diverted after 9/11. This allows any
"Duly Authorized...." person to extract the contents of anything,
any time it's on the network. Magic Lantern is the hook for
"Carnivore" (and others) that uses bits in the same packet area
as the ECN bits to tell the M$ kernel not to forward the attached
packet on to mail, but to use it as a command for an internal spy
engine that sends information back using the same methods. Since
most all fire-walls are open to the mail port. This is the way the
United States Government can look into every machine on the network.
This was originally conceived as a key-stoke logger and hook for
Magic Eye, Network Eye, PC Anywhere, and other such "tools".
Originally the FBI actually had to enter your home and install
it. Now it happens automatically to everybody, just by installing
Windows.
So, there is a device in 30 percent of American Homes that
violates personal privacy, thanks to John Aschroft and the
do-nothing-but-spend Congress that allows him to rape, rob,
pillage and plunder without respect for the Constitution.
Just think what's happening right now in countries where you
are not allowed to speak of such things!
Since there are so many machines, interconnected, there is
some security-by-obscurity, but once the Justice Department
has an idea of what machine to look into, the rest is
trivial.
It is absolutely absurd that somebody thinks that they can
"secure" a Win2k machine. It is absolutely impossible. Any
such document from the NSA has got to be a ruse, used to
make the "enemy" think that some machine they mucked with
is secure. They certainly know better.
Incidentally, I bought a cable modem for somebody in
Californa (I live near New Hampshire, no sales tax). Just
for kicks, I connected it to my cable and hooked up my
lap-top which uses M$ and dynamic IPs. In a few seconds
I had a free connection to the Internet. Even the networks
don't have any security! They don't even know who their
"customers" are! I could be running an terrorist cell off
that connection and nobody would know. But, in the meantime
the government arrests some (ex) Intel Software Engineer
and charges him with supporting terriorism because he had
some, possibly-connected, person mow his lawn (See this
weeks' EETimes). It should scare the hell out of everybody.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Fri, May 09, 2003 at 01:08:08AM -0700, Greg KH wrote:
> On Fri, May 09, 2003 at 10:42:08AM +0300, Muli Ben-Yehuda wrote:
> >
> > For example, a rogue process is calling settimeofday() on your router
> > once a month(!). How are you going to find it? There's no LSM hook for
> > settimeofday()
>
> Yes there is. Check the capable hook for CAP_SYS_TIME. LSM modules can
> get that info quite easily.
Indeed, I missed the fact that LSM modules have a capable
hook. Nonetheless, my original point stands: LSM and hooking kernel
objects are great for security and auditing, hijacking system calls
can be quite useful for debugging, both kernel and userspace.
Thanks,
Muli.
--
Muli Ben-Yehuda
http://www.mulix.org
On Fri, 09 May 2003 14:27:22 EDT, "Richard B. Johnson" said:
> powered OFF. Microsoft installed Magic Lantern software within
> the kernel and within all software updates, (service Pack 2 of
> Win/2000/Prof as part of their deal with the Justice Department
> when our attention was diverted after 9/11. This allows any
> "Duly Authorized...." person to extract the contents of anything,
> any time it's on the network. Magic Lantern is the hook for
> "Carnivore" (and others) that uses bits in the same packet area
> as the ECN bits to tell the M$ kernel not to forward the attached
> packet on to mail, but to use it as a command for an internal spy
> engine that sends information back using the same methods. Since
Umm.. I'm well known as being both a Microsoft and US Govt basher myself, but...
Is this tinfoil-helmet time, or do you have any evidence to back this up?
On Fri, 9 May 2003 [email protected] wrote:
> On Fri, 09 May 2003 14:27:22 EDT, "Richard B. Johnson" said:
>
> > powered OFF. Microsoft installed Magic Lantern software within
> > the kernel and within all software updates, (service Pack 2 of
> > Win/2000/Prof as part of their deal with the Justice Department
> > when our attention was diverted after 9/11. This allows any
> > "Duly Authorized...." person to extract the contents of anything,
> > any time it's on the network. Magic Lantern is the hook for
> > "Carnivore" (and others) that uses bits in the same packet area
> > as the ECN bits to tell the M$ kernel not to forward the attached
> > packet on to mail, but to use it as a command for an internal spy
> > engine that sends information back using the same methods. Since
>
> Umm.. I'm well known as being both a Microsoft and US Govt basher myself, but...
>
> Is this tinfoil-helmet time, or do you have any evidence to back this up?
>
Google search isn't enough? Try Magic Lantern, get past the stuff
about old-time movies and onto ZDNET-news, Electronic Privacy information
center, etc., etc. Even the FBI admits it, check out the court-orders,
etc. http://usgovinfo.about.com/library/news
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Fri, 09 May 2003 15:18:59 EDT, "Richard B. Johnson" said:
> Google search isn't enough? Try Magic Lantern, get past the stuff
> about old-time movies and onto ZDNET-news, Electronic Privacy information
> center, etc., etc. Even the FBI admits it, check out the court-orders,
> etc. http://usgovinfo.about.com/library/news
The FBI admits Magic Lantern exists.
What I wanted was verification that Microsoft actually *SHIPPED* it as
part of SP2.
Tinfoil Hat wrote:
> Microsoft installed Magic Lantern software within
> the kernel and within all software updates, (service Pack 2 of
> Win/2000/Prof as part of their deal with the Justice Department
> when our attention was diverted after 9/11.
SP2 was released on 4 May 2001 and installed on this computer on
15 May 2001. :)
Hello every one:
This isn't really a reply to Alan Cox, i am just adding my own two bits
worth to the debate.
I would like to say that sys_call_table export should not disappear, as
overriding system calls on-the-fly through loadable modules does have some
very practical applications. This is despite the fact that one has to
write a slightly "ugly cpu-specific glue" to make it happen :-)
Case in point, I wrote a security module for Linux that overrides _all_
237 systemcalls to audit and control the use of the system calls on a per
uid basis. (i.e. if the user was actually allowed to make the system call
or not) and return -EPERM or jump to system call proper.
Not sure whether anyone would be interested in details of my
implementation, so i won't get into it but there are a few things
which may be of interest from the perspective of the esthetical issues:
1. All 237 system calls are succesfully overriden audited and controlled
through precisely one overriding function (there aren't 237 calls
replacing originals in my implementation).
2. The parameters passed to the calls are analysable.
3. The return values from system call proper as well as any
values returned on stace are analysable.
4. The module can be safely loaded and unloaded, as well as functionality
restored because it is (believe it or not :-) possible to track usage.
Now as to why do it as a module, rather than patching the kernel? Well
there are several good reasons, but the most obvious is the reason why
modules were created in the first place: Namely, that new functionality
can be added to a system without having to shut it down, reboot it, or for
that matter interrupt services.
Specifically this module saved a number of systems from ptrace denial of
service attack by simply disallowing ptrace to "untrusted" users on the
systems without any fuss or muss. (There are other more interesting and
exotic uses).
I would also like to say that Linux modules should not be limited to just
device drivers, even though the module infrastructure may have been
originally conceived that way. The beauty of Linux modules, combined with
the monolithic kernel approach is their ability to expand the kernel on
the fly. Heck, who knows, there may be a day when we can simply swap an
entirely brand new kernel into place and simply continue from where the
previous kernel left off.
IMHO, the question really boils down to this: "What is a good reason to
elimnate this ability, when it obviously, provides useful functionality
with some care?"
Saying that kernel programmers are not careful enough to "use it
correctly" is a bit condesending. Removing it for esthetics or for an
obscure notion of "good programming practices" is not a reasonable enough
argument either since we've definitely used a monolithic kernel design at
the largest scale of the spectrum and gotos at the smallest end, because
we favour practicality vs. beauty, hands down, so what gives here?
Now, if the maintainers (esp. Linus or Alan) simply wanna do it "just
'cause", who can really argue with them? (well, I might a bit :-).
Cheers,
Ahmed Masud.
On Gwe, 2003-05-09 at 18:07, [email protected] wrote:
> On Fri, 09 May 2003 13:18:38 BST, Alan Cox said:
>
> > What makes you say that. If the administrator has full priviledges then
> > its kind of irrelevant trying to force anything "for security reasons"
>
> Many security models require that there *not* be one person who has "full"
> privileges (for obvious reasons).
And SELInux already lets you enforce such a policy
On Sat, 2003-05-10 at 16:38, Ahmed Masud wrote:
> Case in point, I wrote a security module for Linux that overrides _all_
> 237 systemcalls to audit and control the use of the system calls on a per
> uid basis. (i.e. if the user was actually allowed to make the system call
> or not) and return -EPERM or jump to system call proper.
I'm pretty sure that auditing by your module can easily be avoided.
examle: pseudocode for the unlink syscall
long your_wrapped_syscall(char *userfilename)
{
char kernelpointer[something];
copy_from_user(kernelpointer, usefilename, ...);
audit_log(kernelpointer);
return original_syscall(userfilename);
}
now.... the original syscall does ANOTHER copy_from_user().
Eg I can easily fool your logging by having a second thread change the
filename between the time your code copies it and the time the original
syscall copies it again. The chances of getting the timing right are 50%
at least (been there done that ;)
The only solution for this is to check/audit/log things after the ONE
copy. Eg not by overriding the syscall but inside the syscall.
Hi Arjan,
On 10 May 2003, Arjan van de Ven wrote:
> On Sat, 2003-05-10 at 16:38, Ahmed Masud wrote:
>
> > Case in point, I wrote a security module for Linux that overrides _all_
> > 237 systemcalls to audit and control the use of the system calls on a per
> > uid basis. (i.e. if the user was actually allowed to make the system call
> > or not) and return -EPERM or jump to system call proper.
>
> I'm pretty sure that auditing by your module can easily be avoided.
>
> examle: pseudocode for the unlink syscall
>
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> return original_syscall(userfilename);
> }
>
> now.... the original syscall does ANOTHER copy_from_user().
> Eg I can easily fool your logging by having a second thread change the
> filename between the time your code copies it and the time the original
> syscall copies it again. The chances of getting the timing right are 50%
> at least (been there done that ;)
You are right, if operations occur in the sequence above.
What are your thoughts on the following two
A)
long wrapper_call(args) {
yield(random(threshold)); /* yeild is a sleep */
rv = orig_syscall(args...);
copy_from_user(audit_data,args...);
audit_log(audit_data);
return rv;
}
That becomes a bit more difficult to time, because the attacker doesn't
know when the system call will actually perform its own copy_from_user vs.
return vs. the audit's copy_from_user, If the correct upper threshold for
each system call is picked timing attacks can be made alot harder.
B)
long wrapper_call(args) {
yield(random(threshold));
copy_from_user(audit_data1,args...);
rv = orig_syscall(args...);
yield(random(threshold));
copy_from_user(audit_data2,args...);
audit_log(audit_data1);
audit_log(audit_data2);
return rv;
}
Suspicious data analysis is then performed by a user-land tool to
further ensure integrity.
I would just like to say that above does not pretend to be speed happy,
still for practical purposes you can assume that the yield is a lot
shorter (a couple of orders of magnitude) than the duration of the system
call.
Cheers,
Ahmed.
On Sat, May 10, 2003 at 01:51:07PM -0400, Ahmed Masud wrote:
> What are your thoughts on the following two
> A)
> long wrapper_call(args) {
> yield(random(threshold)); /* yeild is a sleep */
> rv = orig_syscall(args...);
> copy_from_user(audit_data,args...);
> audit_log(audit_data);
> return rv;
> }
>
> That becomes a bit more difficult to time, because the attacker doesn't
> know when the system call will actually perform its own copy_from_user vs.
> return vs. the audit's copy_from_user, If the correct upper threshold for
> each system call is picked timing attacks can be made alot harder.
no it's not. just make sure the page with the filename is paged
out, and use mincore to poll for the pagefault ;)
And with unlink you can observe the results as well (think dnotify) so you
can intervene before the second audit copy
>
> B)
> long wrapper_call(args) {
> yield(random(threshold));
> copy_from_user(audit_data1,args...);
> rv = orig_syscall(args...);
> yield(random(threshold));
> copy_from_user(audit_data2,args...);
> audit_log(audit_data1);
> audit_log(audit_data2);
> return rv;
> }
>
> Suspicious data analysis is then performed by a user-land tool to
> further ensure integrity.
still not secure, now you copy 3 times so all I need to do is flip
data twice ;)
On Sat, 10 May 2003, Arjan van de Ven wrote:
> On Sat, May 10, 2003 at 01:51:07PM -0400, Ahmed Masud wrote:
> > That becomes a bit more difficult to time, because the attacker doesn't
> > know when the system call will actually perform its own copy_from_user vs.
> > return vs. the audit's copy_from_user, If the correct upper threshold for
> > each system call is picked timing attacks can be made alot harder.
>
> no it's not. just make sure the page with the filename is paged
> out, and use mincore to poll for the pagefault ;)
> And with unlink you can observe the results as well (think dnotify) so you
> can intervene before the second audit copy
>
> still not secure, now you copy 3 times so all I need to do is flip
> data twice ;)
>
Very interesting indeed, good thing i am not auditing parameters ;)
hehe. The only thing i was tracking was whether the particular system call
was allowed or denied to the user, due to an ACL and because that doesn't
rely on the user-land data i am not particulary effected.
I will setup some parametric auditing on pointer data and attack the
environment using your technique above to see if something can be done
about it.
(heheh there goes the afternoon!)
Cheers,
Ahmed.
Hello again Arjan,
Further to the previous reply, a user can simply be denied mincore :) so
no valid information would come to the attacker. I know that there is no
such problem within the allow/deny code so the attack threat can be
minimized further.
Ahmed.
Arjan van de Ven wrote:
> The only solution for this is to check/audit/log things after the ONE
> copy. Eg not by overriding the syscall but inside the syscall.
Well, not exactly the _only_ one. Inferior alternatives include
- tracking all user memory accesses, and protecting those pages
against modifications by other processes (but this adds a
deadlock risk)
- like above, but copy all such data to/from a user space area
that is then protected
- like above, but use kernel space instead, then use
set_fs(KERNEL_DS)
Which probably just proves that there are always more painful ways
to do things :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Ahmed Masud wrote:
> yield(random(threshold)); /* yeild is a sleep */
[...]
> That becomes a bit more difficult to time, because the attacker doesn't
> know when the system call will actually perform its own copy_from_user vs.
So the probability of getting through in one try is about (tR+tH)/tH,
where tR is the average random delay, and tH is the time between the
check and the actual access.
If you keep on trying until you get through, you'll succeed on average
after tR^2/tH+tR.
If you make tR = 1 s (that's pretty long, e.g. if you do this to
unlink(2), a rm -rf of the kernel source tree would take about four
hours) and assume that tH is only one microsecond, the race condition
can still be exploited within typically less than one fortnight.
Since the system would be idle most of the time, such a brute-force
attack could easily go unnoticed, even if somebody cares to monitor
the system often enough.
Sounds like voodoo security to me.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On 10 May 2003, Arjan van de Ven wrote:
> I'm pretty sure that auditing by your module can easily be avoided.
>
> examle: pseudocode for the unlink syscall
>
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> return original_syscall(userfilename);
> }
>
> now.... the original syscall does ANOTHER copy_from_user().
> Eg I can easily fool your logging by having a second thread change the
> filename between the time your code copies it and the time the original
> syscall copies it again. The chances of getting the timing right are 50%
> at least (been there done that ;)
>
> The only solution for this is to check/audit/log things after the ONE
> copy. Eg not by overriding the syscall but inside the syscall.
been there done that, too :)
However, there is a solution.
Masud, your delay-based solutions won't work because an attack code can
just keep running in a loop until it gets the timing right. Once is
enough. Even if it could work, it would have impact on the whole system.
Afaik, you can't really yield the CPU for very short time slices so you'll
have to busy-loop instead, and its not acceptable. I believe the below
solution is the right one. Arjan, please correct me if I'm wrong.
The solution is to have only ONE REAL copy, done by the wrapper. The
original syscall will copy from a kernel ptr, unknowingly. Consider
the following modified pseudo-code:
long your_wrapped_syscall(char *userfilename)
{
char kernelpointer[something];
copy_from_user(kernelpointer, usefilename, ...);
audit_log(kernelpointer);
old_fs = get_fs();
set_fs(KERNEL_DS);
ret = original_syscall(kernelpointer);
set_fs(old_fs);
return ret;
}
userfilename is only copied once. original_syscall just copies
kernelpointer again, to another kernel pointer. No race.
Now, don't get me wrong - I still think intercepting the syscall is not
the right thing to do in this case, since LSM provides hooks in better
locations. However, Masud has a working module that works this way, and
rewriting it for LSM is probably a headache. No reason for him to rewrite
his module if it can be fixed as I suggested above.
Still, in my opinion, this symbol should remain exported, if only for test
modules like the one suggested by Muli. I use them all the time.
Another thing to be considered is kernel newbies. Most people, when
getting started in the kernel, do it by writing module. One of the best
ways to understand certain parts of the kernel is to intercept some
syscalls and playing with them. I gave some courses about it in the past,
and syscall intercepting was always a good exercise for newbies. Face it,
most people can learn better by playing then by reading the code.
Removing this symbol will not really get in the way for the bad guys
because it'll always be possible to find and intercept it anyway (see my
previous post in this thread), but it'll increase the learning curve for
kernel newbies. Do we really want that ?
Yoav Weiss
Arjan van de Ven wrote:
> I'm pretty sure that auditing by your module can easily be avoided.
>
> examle: pseudocode for the unlink syscall
>
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> return original_syscall(userfilename);
> }
Great, now how do you plan to get that code loaded into memory on
my configuration? (no modules, /dev/kmem unwriteable) (or ipd driver
loaded on NT/2K)
> The only solution for this is to check/audit/log things after the ONE
> copy. Eg not by overriding the syscall but inside the syscall.
If I can alter kernel memory I can patch out your auditing code.
It's just more difficult if you try to hide it inside the syscall. :)
On Sat, May 10, 2003 at 10:18:34PM +0300, Yoav Weiss wrote:
> Masud, your delay-based solutions won't work because an attack code can
> just keep running in a loop until it gets the timing right. Once is
> enough. Even if it could work, it would have impact on the whole system.
> Afaik, you can't really yield the CPU for very short time slices so you'll
> have to busy-loop instead, and its not acceptable. I believe the below
> solution is the right one. Arjan, please correct me if I'm wrong.
>
> The solution is to have only ONE REAL copy, done by the wrapper. The
> original syscall will copy from a kernel ptr, unknowingly. Consider
> the following modified pseudo-code:
>
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> old_fs = get_fs();
> set_fs(KERNEL_DS);
> ret = original_syscall(kernelpointer);
> set_fs(old_fs);
> return ret;
> }
>
> userfilename is only copied once. original_syscall just copies
> kernelpointer again, to another kernel pointer. No race.
This approach, while it would solve this particular problem, has a
grave flow. Consider the case where the first copy in the
original_syscall is to copy a user space structure, which has embedded
user space pointers... The set_fs() will cause future
copy_from_user/copy_to_user in original_syscall() calls to succeed
even if the user supplied pointer is in kernel space.
> Removing this symbol will not really get in the way for the bad guys
> because it'll always be possible to find and intercept it anyway (see my
> previous post in this thread), but it'll increase the learning curve for
> kernel newbies. Do we really want that ?
Hear hear.
--
Muli Ben-Yehuda
http://www.mulix.org
> This approach, while it would solve this particular problem, has a
> grave flow. Consider the case where the first copy in the
> original_syscall is to copy a user space structure, which has embedded
> user space pointers... The set_fs() will cause future
> copy_from_user/copy_to_user in original_syscall() calls to succeed
> even if the user supplied pointer is in kernel space.
You're right, which is why I wouldn't offer it as a general mechanism. I
was merely offering a method to solve the current issue and fix Masud's
problem. This solution is good in many cases but dangerous in others. It
can be used as long as you inspect the original syscall to make sure its
param is just a simple string/int. True in most cases though.
Note that this method is used, often by common LSM modules, for opening
and handling files from kernel space. (think persistent labeling on a
generic filesystem).
>
> > Removing this symbol will not really get in the way for the bad guys
> > because it'll always be possible to find and intercept it anyway (see my
> > previous post in this thread), but it'll increase the learning curve for
> > kernel newbies. Do we really want that ?
>
> Hear hear.
;-)
Yoav Weiss
Yoav Weiss wrote:
>The solution is to have only ONE REAL copy, done by the wrapper. The
>original syscall will copy from a kernel ptr, unknowingly. Consider
>the following modified pseudo-code:
Let's see you do sys_execve()... sys_socketcall() and sys_ioctl() are
fun, too. (And, I worry about doubly-indirected pointers, for instance.)
It's probably do-able, but you'd better stock up on the Advil in advance:
we're in major headache country, folks.
>Now, don't get me wrong - I still think intercepting the syscall is not
>the right thing to do in this case, since LSM provides hooks in better
>locations.
Right. LSM seems like a better answer for security applications.
> Let's see you do sys_execve()... sys_socketcall() and sys_ioctl() are
> fun, too. (And, I worry about doubly-indirected pointers, for instance.)
> It's probably do-able, but you'd better stock up on the Advil in advance:
> we're in major headache country, folks.
I agree. I could post my 2.0.x code for doing this, but it would be
counter-productive since security apps should use LSM for this very
reason. I was merely suggesting a way for Masud to solve his specific
problem without rewriting his module.
sys_execve() and sys_socketcall() are actually not that hard. sys_ioctl()
is next to impossible because no never know what the structs look like.
Luckily, most security apps don't require ioctl-screening.
Most security applications should use LSM but its not a good reason to
remove sys_call_table, since its still useful for many non-security
purposes.
Yoav Weiss
On Sat, 10 May 2003, Yoav Weiss wrote:
> You're right, which is why I wouldn't offer it as a general mechanism. I
> was merely offering a method to solve the current issue and fix Masud's
> problem. This solution is good in many cases but dangerous in others. It
> can be used as long as you inspect the original syscall to make sure its
> param is just a simple string/int. True in most cases though.
Like any security solution, if the vulnerabilities are identifiable and
outlinable then the risk can be approximated. This is what is really
needed in real-life situations, reallistically, we simply want an
assessment of risk to clearly define what is and isn't acceptable level of
protection.
I will attempt to actually build attack scenarios based on these comments
and post results to any one who is interested.
Just as an aside, my solution isn't actually performing its control
through user-space params requiring copy_from_user.
Cheers,
Ahmed.
[Arjan van de Ven]
> On Sat, 2003-05-10 at 16:38, Ahmed Masud wrote:
>> Case in point, I wrote a security module for Linux that overrides _all_
>> 237 systemcalls to audit and control the use of the system calls on a per
>> uid basis. (i.e. if the user was actually allowed to make the system call
>> or not) and return -EPERM or jump to system call proper.
> I'm pretty sure that auditing by your module can easily be avoided.
> examle: pseudocode for the unlink syscall
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> return original_syscall(userfilename);
> }
> now.... the original syscall does ANOTHER copy_from_user().
> Eg I can easily fool your logging by having a second thread change the
> filename between the time your code copies it and the time the original
> syscall copies it again. The chances of getting the timing right are 50%
> at least (been there done that ;)
> The only solution for this is to check/audit/log things after the ONE
> copy. Eg not by overriding the syscall but inside the syscall.
just replace
return original_syscall(userfilename);
with
return original_syscall(kernelpointer);
--
- Terje
[email protected]
On Sun, 11 May 2003, Terje Malmedal wrote:
>
>
> just replace
> return original_syscall(userfilename);
> with
> return original_syscall(kernelpointer);
That's about right :) need to switch to kernel space with set_fs though
(see other messages in the thread). - A
arjanv wrote:
> examle: pseudocode for the unlink syscall
>
> long your_wrapped_syscall(char *userfilename)
> {
> char kernelpointer[something];
> copy_from_user(kernelpointer, usefilename, ...);
> audit_log(kernelpointer);
> return original_syscall(userfilename);
> }
That code has another hole that nobody else has mentioned
yet: I can fill the audit log by trying to delete nonexistent files,
and if accused of trying to mount a DOS attack on the audit trail
I can reasonably claim that it was all an accident...
How about:
long wrapped_unlink(char *userfilename)
{
char name1[len], name2[len];
long ret;
copy_from_user(name1, userfilename, ...);
ret = original_unlink(userfilename);
copy_from_user(name2, userfilename, ...);
if (strncmp(name1, name2, len))
audit_log(name1, name2, UNLINK_NAME_CHANGED);
if (ret == 0 && AUDIT_SUCCESS)
audit_log(name1, name2, UNLINK_SUCCEEDED);
if (ret == -EPERM && AUDIT_FAILURE)
audit_log(name1, name2, UNLINK_FAILED);
return ret;
}
Chuck Ebbert wrote:
> How about:
Here's a good rule of thumb:
In security, if you need to ask, it's probably broken.
The way to do security design is NOT by throwing a few kludgy hacks
against the wall and seeing if any of them stick. That's a recipe for
security holes. The way you get security right is by building a clean,
simple design that you can convincingly argue to be correct. In security,
you don't write code until you've got your assurance argument down pat.
If you can't convince yourself it's correct, it's probably not.
> copy_from_user(name1, userfilename, ...);
> ret = original_unlink(userfilename);
> copy_from_user(name2, userfilename, ...);
Insecure. Simply exploit the race condition twice: once just before
the original_unlink() to change the string to a dangerous filename,
then a second time just after the original_unlink() to change it back.
On Sun, 11 May 2003, Chuck Ebbert wrote:
> That code has another hole that nobody else has mentioned
> yet: I can fill the audit log by trying to delete nonexistent files,
> and if accused of trying to mount a DOS attack on the audit trail
> I can reasonably claim that it was all an accident...
No one specified what audit_log does in this case. Usually, in such
modules, the audit function doesn't just log everything. It can, for
example, rate-limit the logging and just spit a message about the user
DoSing the log system. If system is paranoid enough to be fail-closed
(i.e. fears that the user is hitting the rate-limit of the logger, hoping
to cover his real acts), it can always kill his task, kill all his
processes, lock his account, call the authorities, etc. Its up to the
system to decide what audit_log does, just like in any other auditing
system.
>
> How about:
>
> long wrapped_unlink(char *userfilename)
> {
> char name1[len], name2[len];
> long ret;
>
> copy_from_user(name1, userfilename, ...);
> ret = original_unlink(userfilename);
> copy_from_user(name2, userfilename, ...);
>
> if (strncmp(name1, name2, len))
> audit_log(name1, name2, UNLINK_NAME_CHANGED);
Still subject to a timing attack. The usermode code can change it and
change it back as soon as the file has been unlinked. If the system is
under heavy load (generated by the attacker), the attacking process is
reniced to 20, and the monitoring part of it has higher priority and keeps
stat(2)ing the file, a thread in the attack code may actually be able to
change the filename back in time for the second check.
The only way to avoid these races is to have just one copy, by either
using set_fs (see my previous post in this thread) or by hooking inside
the syscall (as LSM does).
Yoav Weiss
Yoav Weiss wrote:
> No one specified what audit_log does in this case. Usually, in such
> modules, the audit function doesn't just log everything. It can, for
> example, rate-limit the logging and just spit a message about the user
> DoSing the log system.
Not on the systems I've seen. Max log file size is 4GB and the
logs are on their own partition. If the file fills up the system
crashes immediately and only administrators can log in after reboot
until the logs are archived.
> The usermode code can change it and
> change it back as soon as the file has been unlinked. If the system is
> under heavy load (generated by the attacker), the attacking process is
> reniced to 20, and the monitoring part of it has higher priority and keeps
> stat(2)ing the file, a thread in the attack code may actually be able to
> change the filename back in time for the second check.
Yes, but now any unsuccessful attempts to change the name will be
logged, where before there was basically no risk for the attacker
trying over and over until success. Even a single failure could
raise an alert on the target machine, something a cracker definitely
does not want to happen.
> Not on the systems I've seen. Max log file size is 4GB and the
> logs are on their own partition. If the file fills up the system
> crashes immediately and only administrators can log in after reboot
> until the logs are archived.
Why would anyone design a system like that ?!
The logging of every security system is prone to flooding. You may have
noticed that your syslog sometimes spits "Last message repeated N times"
so it won't repeat itself. A system that doesn't deal with this issue in
any way can't be secure. There are a lot of methods to deal with it but I
think we're going seriously off-topic here so if anyone wishes to continue
discussing this specific logging problem, I suggest we switch to non-lkml
mode.
> Yes, but now any unsuccessful attempts to change the name will be
> logged, where before there was basically no risk for the attacker
> trying over and over until success. Even a single failure could
> raise an alert on the target machine, something a cracker definitely
> does not want to happen.
>
Not necessarily - it depends on the case. If the file being unlinked is
the logfile itself, and its checked by an cron job every once in a while
(a common situation), an attacker won't mind making a lot of noise into
the soon-to-be-a-free-inode logfile. After-the-fact security systems are
usually not suitable for server protection, and the system you suggest,
being statistical, is after-the-fact by definition.
Yoav Weiss
On Sun, 11 May 2003, Chuck Ebbert wrote:
> Yoav Weiss wrote:
>
> > No one specified what audit_log does in this case. Usually, in such
> > modules, the audit function doesn't just log everything. It can, for
> > example, rate-limit the logging and just spit a message about the user
> > DoSing the log system.
>
> Not on the systems I've seen. Max log file size is 4GB and the
> logs are on their own partition. If the file fills up the system
> crashes immediately and only administrators can log in after reboot
> until the logs are archived.
In a production system various things happen (no particular order):
-- The audit log functionality allows synchrnous or asynchrnous logging,
as driven by security policy. This would mostly be asynch (think how klog
works)
-- The audit log probably does not go to a local disk but to a log server.
-- The system allows for more finely controlled auditing and gives the
security admin the ability to observe a particular system call for a
particular user, under particular circumstances. (Audit a failed system
call or a successful system call). In case of unlink, it's probably more
useful to observe a system call that does a ret=0 than ret=-Exxx.
-- Really what we are interested in auditing are untrusted users, who
should be given a limited access to begin with.
In addition, because it provides allow/deny control ability on a per-uid
per-syscall basis, we can simply deny unlink to any thing which is rogue
and not audit it's success or failure. :^) . It WILL fail with an -EPERM,
so why audit it permanently? (One can audit it for a brief period to
create an evidence trail and then turn the audit tap off).
The system I am talking about already allows for all of the above. The
only problem was the issue of defeating accurate due to potential multiple
copy_from_user calls ... Which has been addressed rather nicely thanks to
you all (my thanks to Arjan for pointing out the issue and to Yoav and
Terje for giving solution hints).
Cheers,
Ahmed.
On Sul, 2003-05-11 at 23:32, Yoav Weiss wrote:
> > Not on the systems I've seen. Max log file size is 4GB and the
> > logs are on their own partition. If the file fills up the system
> > crashes immediately and only administrators can log in after reboot
> > until the logs are archived.
>
> Why would anyone design a system like that ?!
> The logging of every security system is prone to flooding. You may have
> noticed that your syslog sometimes spits "Last message repeated N times"
> so it won't repeat itself. A system that doesn't deal with this issue in
> any way can't be secure. There are a lot of methods to deal with it but I
Security to some people means that nothing happens unrecorded. Most high
security environments treat DoS attacks as the least interesting. You
knocked down my military server - who cares. You stole my list of secret
command centres - Im deeply upset.
Security requirements are heavily dependant on role and people sometimes
forget that. Being down is bad news for an ecommerce site but in many
other situations its infinite preferably to most other situations
> Security requirements are heavily dependant on role and people sometimes
> forget that. Being down is bad news for an ecommerce site but in many
> other situations its infinite preferably to most other situations
This reminds me of a funny story. I was at a meeting to confirm that a
program met some requirements for an agency in Maryland with a three-letter
name. After they finally agreed that I had to *see* the requirements before
I could assert that the program met them, we came to a requirement that
said, roughly, that it must be possible to immediately stop the system from
processing any information if they lost, or suspected that they had lost,
control over it.
I pointed out to them that any software mechanism I devised for shutting
the system down would require that they had control over the system in order
to invoke the mechanism.
They thought about that for a moment and were about to find that the system
did not meet the requirements. I pointed out that anyone could pull the plug
or network cable if needed or shut the system down at the switch and that
this could be accomplished even if they lost control over the system and
would certainly stop it from sending any information. They then agreed that
the system met that requirement.
DS
> Security to some people means that nothing happens unrecorded. Most high
> security environments treat DoS attacks as the least interesting. You
> knocked down my military server - who cares. You stole my list of secret
> command centres - Im deeply upset.
You're right, but panic() is seldom the right solution, even in such case.
In most cases, killing the process generating the noise (or all the
processes of its uid) is enough and keeps the system more robust. Even if
the system runs out of log-space and fails to stop the noise-maker, it can
usually take down network interfaces and alert the admin. Still cleaner
than crashing. Thats what Cray Unicos does when low on certain resources.
It stops the attack while minimizing recovery time. (Rebooting these
beasts can be painful). The system stops network traffic while it still
has some space left, and panics only if it still runs out of space
(unlikely). No audit trail lost and its more graceful.
I'm not saying there is no possible situation where crashing
is required, but its rare.
>
> Security requirements are heavily dependant on role and people sometimes
> forget that. Being down is bad news for an ecommerce site but in many
> other situations its infinite preferably to most other situations
>
Right. Thats why such systems usually come clustered. The panic()ed
system stays down until checked, but another system takes its role.
Still, military systems have to "think twice" before resorting to crash.
Remember that they're usually attacked just as war commences and all hell
breaks loose. If your defense systems are too easy to DoS, you can't rely
on them and their secrecy won't help you. In such cases an unreliable
system can be worse than no system at all. If allies could DoS the Enigma
and force the Nazis to use their fallback system (which was kinda lame),
they wouldn't have to spend resources on cracking the Enigma.
You might want to read this paper on application sandboxing:
http://www.stanford.edu/~talg/papers/traps/abstract.html
It covers a lot of the wrong ways to do things.
--
"MONO - Monochrome Emulation
This field is used to store your favorite bit."
--FreeVGA Attribute Controller Reference
On Friday 09 May 2003 09:37, Ragnar Kj?rstad wrote:
> On Fri, May 09, 2003 at 08:53:04AM -0500, Jesse Pollard wrote:
> > On Thursday 08 May 2003 14:43, Chuck Ebbert wrote:
> > > > Have you tried catching the display IO ???
> > >
> > > Not in a million years -- display drivers work by pure magic AFAIC.
> > >
> > > > HSM has existed on UNIX based machines for a long time.
> > >
> > > Show me three HSM implementations for Linux and I'll show you three
> > > different mechanisms. :)
> >
> > Actually... I think they all use the same one (Even the Solaris/IRIX/Cray
> > ones do that). All of them provide a filesystem interface via VFS. The
> > Linux ones were implemented via the "userfs" core or NFS.
>
> I'm not sure what you mean by "via" VFS, but most HSM implementations on
> linux require extra interfaces and special support in the filesystem (XDSM
> or propriatary).
All of the HSM systems I've used had the XDSM handled outside the kernel.
The only thing actually IN the kernel was the VFS module to intercept the
VFS calls (like userfs did) and then pass them to an external daemon to
retrieve the data of migrated files. In the two cases, the rest of the
filesystem was implemented within the VFS module. One case I've read about,
(datatree??? something like that) used an NFS interface to allow it to pass
the requests on to a non-HSM filesystem when the data was put on disk.
> The only exceptions I know are openXDSM which was intended to be a
> generic interface in the VFS-layer (but we never got time to implement
> it) and a implementation based on stackable filesystem.
>
> I don't know if the later is actually in use anywere, or if it is
> abandoned.
I don't know either, but all XDSM was to do was support the user mode
infrastructure behind the VFS module.
On Mon, May 12, 2003 at 09:19:25AM -0500, Jesse Pollard wrote:
> All of the HSM systems I've used had the XDSM handled outside the kernel.
You obviously didn't look at the XFS dmapi implementation :)
Alan Cox wrote:
> You knocked down my military server - who cares. You stole my
> list of secret command centres - Im deeply upset.
Or even worse: you stole my list of command centers and I don't
even know it happened. At least with an audit trail you have a
chance of knowing your secrets have been compromised.
Banks also seem to like the idea of a server that will crash
rather than transfer funds without leaving a trail.
So how long till Linux gets decent auditing? Is the SNARE code going
to get into the kernel?
...and on a related topic, if someone wrote a patch to optionally clear
the swap area at swapoff would it ever be accepted?
On Mon, 12 May 2003 18:40:17 +0200, you wrote:
> ...and on a related topic, if someone wrote a patch to optionally clear
> the swap area at swapoff would it ever be accepted?
What about doing it from userspace? :)
--
Ciao,
Pascal
On Llu, 2003-05-12 at 17:32, Chuck Ebbert wrote:
> So how long till Linux gets decent auditing? Is the SNARE code going
> to get into the kernel?
Its on the todo list. I had some discussion with the snare guys and Al
Viro educated them on some of the name logging issues
> ...and on a related topic, if someone wrote a patch to optionally clear
> the swap area at swapoff would it ever be accepted?
man dd ?
although I'm not sure what good it would do you, you want crypted swap
Alan Cox wrote:
>> ...and on a related topic, if someone wrote a patch to optionally clear
>> the swap area at swapoff would it ever be accepted?
>
> man dd ?
"That can be done manually" does not get you the check mark in
the list of features. Management wants idiot-resistant security.
On Llu, 2003-05-12 at 22:51, Chuck Ebbert wrote:
> Alan Cox wrote:
>
> >> ...and on a related topic, if someone wrote a patch to optionally clear
> >> the swap area at swapoff would it ever be accepted?
> >
> > man dd ?
>
> "That can be done manually" does not get you the check mark in
> the list of features. Management wants idiot-resistant security.
man dd
man bash
idiot proof != kernel side
And its still a waste of time because you dont have enough guarantees about
things like drive layout.
On Mon, 12 May 2003 17:51:25 EDT, Chuck Ebbert said:
> Alan Cox wrote:
>
> >> ...and on a related topic, if someone wrote a patch to optionally clear
> >> the swap area at swapoff would it ever be accepted?
> >
> > man dd ?
>
> "That can be done manually" does not get you the check mark in
> the list of features. Management wants idiot-resistant security.
In particular, the code that handles the zeroing out of resource objects
before re-use needs to be "inside" the trusted-base perimeter. This has
been well-understood for years - even my August 83 copy of the Orange Book
says (for class C2):
2.2.1.2 Object Reuse
All authorizations to the information contained within a storage object
shall be revoked prior to initial assignment, allocation, or reallocation
to a subject from the TCB's pool of unused storage objects. No information,
including encrypted representations of information, produced by a prior
subject's actions is to be available to any subject that obtains access
to an object that has been released back to the system.
(OK.. it doesn't have to be in-kernel, but the function *does* have to
be inside the TCB, not out in random userland)...
On Llu, 2003-05-12 at 23:12, [email protected] wrote:
> > "That can be done manually" does not get you the check mark in
> > the list of features. Management wants idiot-resistant security.
>
> In particular, the code that handles the zeroing out of resource objects
> before re-use needs to be "inside" the trusted-base perimeter. This has
> been well-understood for years - even my August 83 copy of the Orange Book
> says (for class C2):
1. Base Linux is not C2 certified
2. C2 is obsolete
3. NSA SELinux can do the needed stuff from scanning the code
4. Even then data erasure is not guaranteed because of the drive logic
So you are back to crypting swap in the first place
On Mon, 12 May 2003 22:19:51 BST, Alan Cox said:
> 1. Base Linux is not C2 certified
> 2. C2 is obsolete
Right.. but the point was that the object-reuse stuff was known 20 years ago
to have to be inside the TCB.... And in the Linux world, having /etc/rc?.d/
and all the dependent code inside the TCB is just... ugly.. ;)
On Mon, 12 May 2003 17:51:25 EDT, Chuck Ebbert said:
> > man dd ?
>
> "That can be done manually" does not get you the check mark in
> the list of features. Management wants idiot-resistant security.
It has nothing to do with idiot-resistance. Why should this multi-write
operation be done in kernel ? mkswap is a usermode program. mkfs is a
usermode program. If you want to have a wipeswap script that copies a
chunk of your /dev/zero to the swap, it should also be in usermode. Just
run it in wherever rc file you use to swapoff.
However, it'll just give you false sense of security. First of all, its
hardware dependent. Second, it won't get wipe in case of a crash (which
is likely to happen when They come to take your disk).
Until linux gets a real encrypted swap (the kind OpenBSD implements), you
can settle for encrypting your whole swap with one random key that gets
lost on reboot. Encrypted loop dev with a key from /dev/random easily
gives you that.
Download the latest loop-AES from http://loop-aes.sourceforge.net/ and
follow the "Encrypting swap on 2.4 kernels" section in README.
You want it secure, never write it to disk. If that is not an option,
then all that is written to a disk must be encrypted. Anything less is
a placebo. Anyways as Alan mentioned:
> 4. Even then data erasure is not guaranteed because of the drive logic
From the write speed differences I've seen on my own system between
writing zero filled buffers and random data filled buffers it looks like
a good number of drives do zero filled block write optimizations. From
the efective write rates on a couple of my drives it looks like they are
just marking the blocks as zero in a master table rather than really
writing zeros out to them.
- Bryan
Yoav Weiss wrote:
> Until linux gets a real encrypted swap (the kind OpenBSD implements), you
> can settle for encrypting your whole swap with one random key that gets
> lost on reboot. Encrypted loop dev with a key from /dev/random easily
> gives you that.
>
> Download the latest loop-AES from http://loop-aes.sourceforge.net/ and
> follow the "Encrypting swap on 2.4 kernels" section in README.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Yoav Weiss wrote:
>> "That can be done manually" does not get you the check mark in
>> the list of features. Management wants idiot-resistant security.
>
> It has nothing to do with idiot-resistance. Why should this multi-write
> operation be done in kernel ? mkswap is a usermode program. mkfs is a
> usermode program. If you want to have a wipeswap script that copies a
> chunk of your /dev/zero to the swap, it should also be in usermode. Just
> run it in wherever rc file you use to swapoff.
And when I type 'swapoff' at the command line the whole scheme fails
unless I am a perfect robot sysadmin and always remember to wipe the
file. This needs to 'fail safe' and it needs to be done within the kernel
to be considered a working feature.
Alan Cox wrote:
> 1. Base Linux is not C2 certified
That could be fixed... (right?) Filesystems returning data past the
end of what the user wrote might be a big problem though -- this must
be guaranteed even in obscure corner cases.
> 2. C2 is obsolete
Obsolete or not, it is mandatory for some people. No check box,
no purchase order (or no certificate of operation.)
> 3. NSA SELinux can do the needed stuff from scanning the code
But will it get merged?
> 4. Even then data erasure is not guaranteed because of the drive logic
People who really care require the drive be reduced to pieces small
enough to fit through a sieve with ~2mm holes in it before it leaves
their sight. For the rest, overwrite of the swap data is a useful if
not 100% reliable step to take. Legitimate users with servers locked
up in secure areas don't really worry about someone unplugging the box
and walking away with it either.
> And when I type 'swapoff' at the command line the whole scheme fails
> unless I am a perfect robot sysadmin and always remember to wipe the
> file. This needs to 'fail safe' and it needs to be done within the kernel
> to be considered a working feature.
>
mv /sbin/swapoff /sbin/swapoff.real
cat >/sbin/swapoff
#!/bin/sh
/sbin/swapoff.real
/sbin/wipeswap
^D
chmod +x /sbin/swapoff
Any system that doesn't consider this fail-safe enough, shouldn't rely on
this zeroing operation even if performed by the kernel. See the URL I
posted in this thread about encrypted swap.
> cat >/sbin/swapoff
> #!/bin/sh
> /sbin/swapoff.real
> /sbin/wipeswap
> ^D
> chmod +x /sbin/swapoff
OK...
# rpm --freshen mount-2.11n-12.rpm
swapoff get silently replaced AFAICT.
On Monday 12 May 2003 20:57, Chuck Ebbert wrote:
> Alan Cox wrote:
> > 1. Base Linux is not C2 certified
>
> That could be fixed... (right?) Filesystems returning data past the
> end of what the user wrote might be a big problem though -- this must
> be guaranteed even in obscure corner cases.
No - C2 evaluation has not been done for almost 3 years. That makes it
impossible to get a C2 evaluation.
> > 2. C2 is obsolete
>
> Obsolete or not, it is mandatory for some people. No check box,
> no purchase order (or no certificate of operation.)
Bullshit - NO OS is C2 anymore. The last certification was given to MS for
NT 4 - about 3 years ago. NONE of the current systems are C2. The best you
can get is "C2 like capability", and that is not a verified operation. And
"C2 like capability" Linux does just as well as M$. Are the log files as
pretty as would be desired? No. But they are acceptable for all US usage
where a UNIX system is acceptable. (And don't even try to claim M$ produces
a secured box... I haven't even been able to find the "trusted facility
manual" for the released systems... which is a requirement for operation.
> > 3. NSA SELinux can do the needed stuff from scanning the code
>
> But will it get merged?
I don't know, but I hope so. (2.7 maybe?)
> > 4. Even then data erasure is not guaranteed because of the drive logic
>
> People who really care require the drive be reduced to pieces small
> enough to fit through a sieve with ~2mm holes in it before it leaves
> their sight. For the rest, overwrite of the swap data is a useful if
> not 100% reliable step to take. Legitimate users with servers locked
> up in secure areas don't really worry about someone unplugging the box
> and walking away with it either.
These are also the same people that will not (or should not) accept laptops in
their environement.
On Monday 12 May 2003 17:57, Yoav Weiss wrote:
> On Mon, 12 May 2003 17:51:25 EDT, Chuck Ebbert said:
> > > man dd ?
> >
> > "That can be done manually" does not get you the check mark in
> > the list of features. Management wants idiot-resistant security.
>
> It has nothing to do with idiot-resistance. Why should this multi-write
> operation be done in kernel ? mkswap is a usermode program. mkfs is a
> usermode program. If you want to have a wipeswap script that copies a
> chunk of your /dev/zero to the swap, it should also be in usermode. Just
> run it in wherever rc file you use to swapoff.
>
> However, it'll just give you false sense of security. First of all, its
> hardware dependent. Second, it won't get wipe in case of a crash (which
> is likely to happen when They come to take your disk).
It is also not a valid wipe either.
This particular object reuse assumes the hardware is in a secured area. If it
is in a secured area then you don't need to wipe it. It remains completely
under the systems control (even during a crash and reboot). The interval
between crash and reboot is covered by the requirement to be in a secured
area. As long as the system doesn't reallocate the data to another process
everything is acceptable. The only condition is that the area is erased
BEFORE allocation. And if it is erased by the NEW data, then the old data
is gone. In the case of swap, the page is first zeroed (given to process as
a demand zero, or preloaded with the users data), then written to swap and
allocated. This last twostep is the only question - is it written first, then
allocated, or allocated and written?
> Until linux gets a real encrypted swap (the kind OpenBSD implements), you
> can settle for encrypting your whole swap with one random key that gets
> lost on reboot. Encrypted loop dev with a key from /dev/random easily
> gives you that.
Ahhh not a good idea if you want job restart or suspend/resume. And large
systems DO want a job restart... as do laptops. During suspension you can
do anything to the disk (as in remove it, insert in another system, read
it, then put it back ...)
Such low level object reuse is not viable. This is from the same book that
states it must be overwritten 3-5 times (something in that range). The full
overhead reduces throughput to about 25-50% of total capacity. The object
reuse erasure applies to memory as well.
You also cannot guarantee erasure on disk errors either. The bad sectors are
ignored, and remapped to alternate sectors - the data is not deleted. Full
object reuse was written long before disks were performing this action. It
assumes the remapping is done by the OS, and hence, is also to overwrite the
bad sectors as well.
On 12 May 2003, Alan Cox wrote:
> On Llu, 2003-05-12 at 23:12, [email protected] wrote:
> > In particular, the code that handles the zeroing out of resource objects
> > before re-use needs to be "inside" the trusted-base perimeter. This has
> > been well-understood for years - even my August 83 copy of the Orange Book
> > says (for class C2):
>
> 1. Base Linux is not C2 certified
No but it can be, provided appropriate steps are taken on the system,
without recoding the kernel become C2 certified. C2 is not mandatory
access control, it is still in the realm of discretionary access control.
According to the orange book standard, it's the highest DAC standard.
> 2. C2 is obsolete
Or in Common Criteria Equivalence, achieve EAL3 in various system security
classes. No labelled security classes would be required.
> 3. NSA SELinux can do the needed stuff from scanning the code
Yes but SELinux is not an attempt to get C2 ... it is more to try and get
B1 equivalence (Labelled security) however B1 by itself is no good because
it doesn't secure against the security officer itself, which is a
desirable feature in order to achieve EAL4 for labeled security under
common-criteria.
Linux kernel, as a whole, will not meet the EAL4 requirements as it
stands, because it is not documented as per the EAL4 code documentation
standard. This is from my direct discussions with two independent CC
evaluation laboritory top technical advisors. The sheer size of the linux
kernel code, and the speed at which it develops and enhances, makes it a
bad candidate.
> 4. Even then data erasure is not guaranteed because of the drive logic
Data-erasure is not a requirement to ensure a trusted-operating system, if
it can be demonstrated that on a running system no one except the
kernel-user can get access to the swap information. If along with this it
can be proven (certified really) that the phyiscal security of the system
meets a certain level of assurance then you can consider the environment
trusted.
> So you are back to crypting swap in the first place
>
But isn't swap crypting fun ? :-) Running encrypted swap is okay so long
as we throw away the key after each session. This can be easily (famous
last words) achieved under crypto kernels. I am not certain if such
functionaility is being contemplated for the Linux kernel along with the
new cryptoloop stuff, if there isn't i can volunteer to put something like
that in - if we are interested. Are we?
Cheers,
Ahmed.
> # rpm --freshen mount-2.11n-12.rpm
>
>
> swapoff get silently replaced AFAICT.
>
This would also happen with the encrypted swap solution since it
replaces swapon.
Wouldn't it happen on upgrades even if you put it all in the kernel, since
the stock redhat kernel would have it off by default ? (It will be
off by default because most people won't need this feature and it'll have
a performance penalty) ?
I guess when you maintain a non-default system you must have some
post-upgrade procedure. I know I do.
> > Until linux gets a real encrypted swap (the kind OpenBSD implements), you
> > can settle for encrypting your whole swap with one random key that gets
> > lost on reboot. Encrypted loop dev with a key from /dev/random easily
> > gives you that.
>
> Ahhh not a good idea if you want job restart or suspend/resume. And large
> systems DO want a job restart... as do laptops. During suspension you can
> do anything to the disk (as in remove it, insert in another system, read
> it, then put it back ...)
>
While I agree with most of what you said in your post, I fail to see the
problem with this one. My laptop has encrypted swap and it poses no
problem when suspending. The disk can be taken out and read, but its
encrypted with a random key that exists only in memory so its harder to
extract. (and if someone can extract my memory, the swap is the least of
my concerns).
Maybe you're talking about hibernation rather than suspension. (when
everything is written to disk and the memory is wiped). In this case,
again, the encrypted swap's key is the least of your concern since all
your memory is written to disk plaintext anyway. If hibernation is
implemented in software, you can have it encrypted too, and require a
user-supplied key upon restarting. If its implemented by the hardware, I
guess there isn't much you can do. Just have the kernel do the
hibernation into an encrypted loopdev and halt the machine.
Masud wrote:
> But isn't swap crypting fun ? :-) Running encrypted swap is okay so long
> as we throw away the key after each session. This can be easily (famous
> last words) achieved under crypto kernels. I am not certain if such
> functionaility is being contemplated for the Linux kernel along with the
> new cryptoloop stuff, if there isn't i can volunteer to put something
> like that in - if we are interested. Are we?
See http://loop-aes.sourceforge.net/
The README already explains how to use it as encrypted swap. I've been
using it for quite a while without problems.
If you feel like volunteering for an encrypted swap, I suggest the model
used by OpenBSD. Instead of using an encrypted swap dev with one random
key, they seem to have a per-process key and encrypt swap areas of the
process with its key. When a process dies, its key dies with it, so the
swap space it used is considered clean without having to wait for an
override or a reboot.
Another fun project is encrypted hibernation (suspend-to-disk). Once the
kernel contains a stable hibernation option, I'm certainly going to
encrypt it.
Jesse Pollard wrote:
> No - C2 evaluation has not been done for almost 3 years. That makes it
> impossible to get a C2 evaluation.
The people who used to require that still have lists of approved
operating systems. Linux is not on that list.
> And
> "C2 like capability" Linux does just as well as M$. Are the log files as
> pretty as would be desired? No. But they are acceptable for all US usage
> where a UNIX system is acceptable.
"No audit trail" pretty much kills it right from the get-go.
Base Solaris has it. And I'm pretty sure HP-UX 9 did but that was
a while ago...
And real ACLs are only now getting into Linux... how long till someone
certifies that they work is anyone's guess.
> These are also the same people that will not (or should not) accept
> laptops in their environement.
Untrusted users shouldn't be allowed cellphones, PDAs, laptops or
similar.
Next step up is probably full body cavity search to make sure you
haven't hidden a Microdrive somehwere...
Jesse Pollard wrote:
> > However, it'll just give you false sense of security. First of all, its
> > hardware dependent. Second, it won't get wipe in case of a crash (which
> > is likely to happen when They come to take your disk).
>
> It is also not a valid wipe either.
>
> This particular object reuse assumes the hardware is in a secured area. If it
> is in a secured area then you don't need to wipe it. It remains completely
> under the systems control (even during a crash and reboot). The interval
> between crash and reboot is covered by the requirement to be in a secured
> area.
...until the admin walks in, shuts down the system, puts it on a cart
and hauls it out the door. Is he going to wipe the swap area before he
does that? Sure, you can write a procedure that says that's what he does
but he will not follow it (been there done that.)
Chuck Ebbert wrote:
> And real ACLs are only now getting into Linux... how long till someone
>certifies that they work is anyone's guess.
>
IIRC ACLs have been available in linux for years,
but there hasn't been an overwhelming demand -
but yes, linux vendors are only now beginning to
ship that sort of thing and the official process can
take awhile unless someone has a specific need.
Joe
On Tuesday 13 May 2003 08:44, Yoav Weiss wrote:
> > > Until linux gets a real encrypted swap (the kind OpenBSD implements),
> > > you can settle for encrypting your whole swap with one random key that
> > > gets lost on reboot. Encrypted loop dev with a key from /dev/random
> > > easily gives you that.
> >
> > Ahhh not a good idea if you want job restart or suspend/resume. And large
> > systems DO want a job restart... as do laptops. During suspension you can
> > do anything to the disk (as in remove it, insert in another system, read
> > it, then put it back ...)
>
> While I agree with most of what you said in your post, I fail to see the
> problem with this one. My laptop has encrypted swap and it poses no
> problem when suspending. The disk can be taken out and read, but its
> encrypted with a random key that exists only in memory so its harder to
> extract. (and if someone can extract my memory, the swap is the least of
> my concerns).
Not the above - that one is obvious that the key can be compromised.
> Maybe you're talking about hibernation rather than suspension. (when
> everything is written to disk and the memory is wiped). In this case,
> again, the encrypted swap's key is the least of your concern since all
> your memory is written to disk plaintext anyway. If hibernation is
> implemented in software, you can have it encrypted too, and require a
> user-supplied key upon restarting. If its implemented by the hardware, I
> guess there isn't much you can do. Just have the kernel do the
> hibernation into an encrypted loopdev and halt the machine.
This one...
Though part of it has to do with large systems and crash. What is done
on some of these systems is to periodically checkpoint batch jobs. If the
kernel crashes, the job has a physical memory failure, a cpu dies (one out
of many...) the system resumes processing (after reboot, or removing the
memory page from the valid list ... whatever recovery method) to then
reload/resume the processes.
If the random key is lost due to a crash, then reload/resume fails.
On Tuesday 13 May 2003 09:45, Chuck Ebbert wrote:
> Jesse Pollard wrote:
> > > However, it'll just give you false sense of security. First of all,
> > > its hardware dependent. Second, it won't get wipe in case of a crash
> > > (which is likely to happen when They come to take your disk).
> >
> > It is also not a valid wipe either.
> >
> > This particular object reuse assumes the hardware is in a secured area.
> > If it is in a secured area then you don't need to wipe it. It remains
> > completely under the systems control (even during a crash and reboot).
> > The interval between crash and reboot is covered by the requirement to be
> > in a secured area.
>
> ...until the admin walks in, shuts down the system, puts it on a cart
> and hauls it out the door. Is he going to wipe the swap area before he
> does that? Sure, you can write a procedure that says that's what he does
> but he will not follow it (been there done that.)
If you are in that situation, the what keeps him from just pulling the plug...
Again, the swap doesn't get purged.
If you are in a situation where swap must be purged (as I am) then you also
know you can't just walk out the door with the system. There must be property
passes, security passes, AND inventory documents that must also show the
contents of the purged disks... signed off by the information security
officer.
On Tuesday 13 May 2003 09:45, Chuck Ebbert wrote:
> Jesse Pollard wrote:
> > No - C2 evaluation has not been done for almost 3 years. That makes it
> > impossible to get a C2 evaluation.
>
> The people who used to require that still have lists of approved
> operating systems. Linux is not on that list.
Neither is windows, OS2, MAC 5/6/7/8/9/10.. for that matter.
> > And
> > "C2 like capability" Linux does just as well as M$. Are the log files as
> > pretty as would be desired? No. But they are acceptable for all US usage
> > where a UNIX system is acceptable.
>
> "No audit trail" pretty much kills it right from the get-go.
It does have audit trails... you do have to turn on process accounting. Are
they pretty... no. But it is equivalent to base Solaris (well, before 2). You
also have to turn on logs from every service daemon.
> Base Solaris has it. And I'm pretty sure HP-UX 9 did but that was
> a while ago...
No current OS has C2 certifications - they have an EAL3 or 4. But not C2.
> And real ACLs are only now getting into Linux... how long till someone
> certifies that they work is anyone's guess.
Real ACLs were available about 2-3 years ago. They just were not accepted
for inclusion, and the patch died.
First some organization has to come up with a good bit of $$$. Evaluations
are not cheap. It would take over a year to get an EAL3, and longer yet to
get 4.
> > These are also the same people that will not (or should not) accept
> > laptops in their environement.
>
> Untrusted users shouldn't be allowed cellphones, PDAs, laptops or
> similar.
>
> Next step up is probably full body cavity search to make sure you
> haven't hidden a Microdrive somehwere...
Remember the body scanner in that Mars based Schwartzenegger movie...
On Tue, 13 May 2003, Jesse Pollard wrote:
> Though part of it has to do with large systems and crash. What is done
> on some of these systems is to periodically checkpoint batch jobs. If the
> kernel crashes, the job has a physical memory failure, a cpu dies (one out
> of many...) the system resumes processing (after reboot, or removing the
> memory page from the valid list ... whatever recovery method) to then
> reload/resume the processes.
>
> If the random key is lost due to a crash, then reload/resume fails.
>
I thought checkpointing usually takes the whole virtual memory of the
process, regardless of whats in swap and whats in real memory, in which
case the encrypted swap key is not an issue. If this isn't the case, I
guess the random key has to be preserved as a part of the checkpointing.
Of course, this beats the whole purpose of encrypted swap unless
checkpointing is done into an encrypted device too. This device must be
encrypted anyway, regardless of swap, because the whole process image will
be stored there.
Yoav Weiss wrote:
> Masud wrote:
>
>
>>But isn't swap crypting fun ? :-) Running encrypted swap is okay so long
>>as we throw away the key after each session. This can be easily (famous
>>last words) achieved under crypto kernels. I am not certain if such
>>functionaility is being contemplated for the Linux kernel along with the
>>new cryptoloop stuff, if there isn't i can volunteer to put something
>>like that in - if we are interested. Are we?
>
>
> See http://loop-aes.sourceforge.net/
> The README already explains how to use it as encrypted swap. I've been
> using it for quite a while without problems.
>
I am familiar with Jari's cryptoloop and related tools and have studied
and am using them for some applications on a few environments.
> If you feel like volunteering for an encrypted swap, I suggest the model
> used by OpenBSD. Instead of using an encrypted swap dev with one random
> key, they seem to have a per-process key and encrypt swap areas of the
> process with its key. When a process dies, its key dies with it, so the
> swap space it used is considered clean without having to wait for an
> override or a reboot.
>
This definitely sounds very interesting. I can start looking at this
problem seriously and see if i can put something together for 2.5.x
since crypto subsystem routines are largely in place.
> Another fun project is encrypted hibernation (suspend-to-disk). Once the
> kernel contains a stable hibernation option, I'm certainly going to
> encrypt it.
>
Yes that too could be a fun thing to do.
Ahmed
On Tue, 13 May 2003, Ahmed Masud wrote:
> This definitely sounds very interesting. I can start looking at this
> problem seriously and see if i can put something together for 2.5.x
> since crypto subsystem routines are largely in place.
Yes, it sounds like an interesting project. Check out openbsd's paper
about this: http://www.openbsd.org/papers/swapencrypt.ps
Let me know when you get it rolling. I'll try to help where I can.
I just hope it has a chance to be included.
> > Another fun project is encrypted hibernation (suspend-to-disk). Once the
> > kernel contains a stable hibernation option, I'm certainly going to
> > encrypt it.
> >
>
> Yes that too could be a fun thing to do.
Actually, I forgot that swsusp is now included. I haven't tried it in a
while. Anyone knows if its stable enough to start playing with encrypting
it ?
Yoav Weiss
On Tue, May 13, 2003 at 05:52:17AM -0400, Chuck Ebbert wrote:
> > cat >/sbin/swapoff
> > #!/bin/sh
> > /sbin/swapoff.real
> > /sbin/wipeswap
> > ^D
> > chmod +x /sbin/swapoff
>
> OK...
>
> # rpm --freshen mount-2.11n-12.rpm
>
> swapoff get silently replaced AFAICT.
Your arguments for swap wiping in the kernel aren't making sense.
It's the distribution that must be made secure, not just the kernel.
And a secure distribution wouldn't nuke its own version of swapoff.
Of course, as others have already noted, you really want encrypted
swap rather than swap wiping at shutdown time.
miket
Jesse Pollard wrote:
>> "No audit trail" pretty much kills it right from the get-go.
>
> It does have audit trails... you do have to turn on process accounting. Are
> they pretty... no. But it is equivalent to base Solaris (well, before 2). You
> also have to turn on logs from every service daemon.
It's almost there but not quite... but there is hope. :)
On Wed, 14 May 2003, Yoav Weiss wrote:
> On Tue, 13 May 2003, Ahmed Masud wrote:
>
> Yes, it sounds like an interesting project. Check out openbsd's paper
> about this: http://www.openbsd.org/papers/swapencrypt.ps
Thank you for this paper, it is a fun read. I do think however that a
few implementation differences should take place:
1. We should not enforce Rijndael as the only choice.
2. Every page should be encrypted iff it marked with some flag. This gives
a generic enough hook to create a swap_encrypt_policy type function to
determine whether it is desirable to encrypt a particular page or not.
2a. The above flag may also be set or cleared by the page-owner process on
a page-to-page basis (something a-kin to mlock()).
3. A slightly more sophisticated timeout framework should be created with
the ability to enforce expiry or request expiry extensions (upto some type
of a system hard limit?) on a per page.
Please comment.
This is an aside: should do we do anything about core dumps?
> Let me know when you get it rolling. I'll try to help where I can.
> I just hope it has a chance to be included.
I will start looking at it seriously within next couple of days actually.
I looked at the swap stuff in mm code yesterday for the first time and it
seems (eeriely) straightforward, and i know i am going to run into an
unseen brick wall :-).
I would suspect that somewhere between the io requst generated by
swap_readpage and swap_writepage cypto can be hooked in... haven't yet
determined where/when the key generations should take place.
Cheers,
Ahmed Masud.
On Wed, 14 May 2003, Mike Touloumtzis wrote:
> It's the distribution that must be made secure, not just the kernel.
> And a secure distribution wouldn't nuke its own version of swapoff.
>
This is not a reply in particular to swapon swapoff issue that preceeded
the above, but a general comment about security since there seem to be
misconceptions about who should be responsible for it.
Level of security is a matter of trust. Should the kernel trust a
distribution provider? No, that is not a reasonable request, because we do
not control their environment and evaluation proceedures and there are no
guarentees between the channel that provides the operating system to the
time it gets installed on a system.
In a secure environment, trusting _any_ user-space application or
combination of user-space applications, is a poor approach to security.
(I refer you to bind, openssl, sendmail, apache and a gazillion other
userspace applications which have exhibited security flaws).
For that matter, trusting the entire kernel also has its flaws (i refer
to the silly ptrace bug found not too long ago in the 2.4.x series).
[Sufficient] security is the state where we can fundamentally guard
against any deliberate sabotage or unintentional mistakes in the
environment.
A reasonable approach to achieving this is to provide a controlled
(single!?) point of evaluation because it is far less likely that issues
with such a controlled point of evaluation for all security will go
unnoticed.
Firewalls are a good example of this. Since we don't trust users on an
internal network to be able to create secure environements individually
w.r.t. the outside world, we create a single point of evaluation into it
through a firewall. Similarly, a good security system will treat anything
comming into it as not particularly trustable untless otherwise proven.
(Guilty until proven innocent approach works *teehee*)
Cheers,
Ahmed.
On Wed, 14 May 2003, Ahmed Masud wrote:
> Thank you for this paper, it is a fun read. I do think however that a
> few implementation differences should take place:
>
> 1. We should not enforce Rijndael as the only choice.
I agree.
>
> 2. Every page should be encrypted iff it marked with some flag. This gives
> a generic enough hook to create a swap_encrypt_policy type function to
> determine whether it is desirable to encrypt a particular page or not.
Good idea. Of course, a policy that only encrypts the secret stuff may
have an impact on security, but it makes sense to let the user decide on a
policy.
>
> 2a. The above flag may also be set or cleared by the page-owner process on
> a page-to-page basis (something a-kin to mlock()).
And just like mlock(), only root will be able to call it.
>
> 3. A slightly more sophisticated timeout framework should be created with
> the ability to enforce expiry or request expiry extensions (upto some type
> of a system hard limit?) on a per page.
>
Why is this one needed ?
> This is an aside: should do we do anything about core dumps?
>
Thats a good question. I see three options:
1. Dump the core plaintext. (sucks but convenient for users).
2. In the core, zero the pages that would be encrypted when swapped out.
On some policies where only things like keys are encrypted, the core
will be usable. On others it won't. (Not sure its really an option).
3. If the core contains pages that should be encrypted, dump it encrypted
with some system-wide (or per-uid) key generated on the first core
dump. The key will be available to the user via some /proc interface.
Its up to the user to be smart and take the core to another host and
decrypt_core(1) it there (or just decrypt_core(1) it to an encrypted
filesystem). In any case, the decrypted core or the system-wide key
are never written to disk.
4. Refuse to dump core of a process that has pages that should be
encrypted.
Do you see more options ?
Anyway, it should probably be policy controlled.
> I looked at the swap stuff in mm code yesterday for the first time and it
> seems (eeriely) straightforward, and i know i am going to run into an
> unseen brick wall :-).
I'm not familiar with this core either and somehow got the same feeling
when I looked into it :)
>
> I would suspect that somewhere between the io requst generated by
> swap_readpage and swap_writepage cypto can be hooked in... haven't yet
> determined where/when the key generations should take place.
Probably at process initialization during fork. The key must be ready
before the process gets its first chance to allocated pages that may be
swapped out.
>
> Cheers,
>
> Ahmed Masud.
>
>
Bye,
Yoav Weiss
On Tuesday 13 May 2003 17:21, Yoav Weiss wrote:
> On Tue, 13 May 2003, Jesse Pollard wrote:
> > Though part of it has to do with large systems and crash. What is done
> > on some of these systems is to periodically checkpoint batch jobs. If the
> > kernel crashes, the job has a physical memory failure, a cpu dies (one
> > out of many...) the system resumes processing (after reboot, or removing
> > the memory page from the valid list ... whatever recovery method) to then
> > reload/resume the processes.
> >
> > If the random key is lost due to a crash, then reload/resume fails.
>
> I thought checkpointing usually takes the whole virtual memory of the
> process, regardless of whats in swap and whats in real memory, in which
> case the encrypted swap key is not an issue. If this isn't the case, I
> guess the random key has to be preserved as a part of the checkpointing.
> Of course, this beats the whole purpose of encrypted swap unless
> checkpointing is done into an encrypted device too. This device must be
> encrypted anyway, regardless of swap, because the whole process image will
> be stored there.
Depends on the system - I believe Cray used to have the option of
checkpointing to the swap device since otherwise the system would be
oversubscribed and subject to deadlock hangs. Other configurations will
exactly what you said, and have the same problems that swap does.
On Wed, 14 May 2003, Yoav Weiss wrote:
> On Wed, 14 May 2003, Ahmed Masud wrote:
>
> >
> > 3. A slightly more sophisticated timeout framework should be created with
> > the ability to enforce expiry or request expiry extensions (upto some type
> > of a system hard limit?) on a per page.
> >
>
> Why is this one needed ?
>
Well we definitely need a way to timeout keys. The other reason is to be
able to "change your mind" about it while the key is being used. This may
not be a useful thing for now but think of encrypted swaps on the
infamous: oopsies-i-tripped-over-a-wire-and-disconnected-network-file-system
Here we have a situation where we want to not have an expired key with
valid data hanging out there.
Or are we saying that expiration only affects encryption and that the
decryption counterpart sticks around until its reference count goes to
zero? On the surface this seems to be easier, although not sure if it
makes us miss any situation.
Cheers,
Ahmed.
On Wed, 14 May 2003, Yoav Weiss wrote:
>
> Thats a good question. I see three options:
> 1. Dump the core plaintext. (sucks but convenient for users).
> 2. In the core, zero the pages that would be encrypted when swapped out.
> On some policies where only things like keys are encrypted, the core
> will be usable. On others it won't. (Not sure its really an option).
> 3. If the core contains pages that should be encrypted, dump it encrypted
> with some system-wide (or per-uid) key generated on the first core
> dump. The key will be available to the user via some /proc interface.
> Its up to the user to be smart and take the core to another host and
> decrypt_core(1) it there (or just decrypt_core(1) it to an encrypted
> filesystem). In any case, the decrypted core or the system-wide key
> are never written to disk.
> 4. Refuse to dump core of a process that has pages that should be
> encrypted.
>
> Do you see more options ?
> Anyway, it should probably be policy controlled.
These are all very good options, ofcourse things get hairy don't they :)
Perhaps in the beginning either 1, 2 and 4 as per a system wide dump
policy. May be even a setrlimit extension and use that as a jump point to
make a per user policy?
Cheers,
Ahmed.
On Wed, 14 May 2003, Ahmed Masud wrote:
> Well we definitely need a way to timeout keys. The other reason is to be
Why ? We keep a key per-process, and wipe it from memory as soon as the
process dies. Its not time-related.
> able to "change your mind" about it while the key is being used. This may
> not be a useful thing for now but think of encrypted swaps on the
> infamous: oopsies-i-tripped-over-a-wire-and-disconnected-network-file-system
Your swapfile is coming from a remote NFS mount ? If your swap becomes
unavailable due to network problems, the swapped-out processes are
probably doomed anyway.
>
> Here we have a situation where we want to not have an expired key with
> valid data hanging out there.
Data is valid iff the owning process still lives. When do we need to
expire a key while its process is still alive (or keep a key valid for
pages of a dead process) ?
Yoav
> > Do you see more options ?
> > Anyway, it should probably be policy controlled.
>
> These are all very good options, ofcourse things get hairy don't they :)
Certainly. Option 3 certainly doesn't have to be implemented in the first
version :)
In fact, the first version could ignore the core dump issue and setrlimit
will be used to avoid core dumps of sensitive processes. In the future,
it can be handled more gracefully.
> Perhaps in the beginning either 1, 2 and 4 as per a system wide dump
> policy. May be even a setrlimit extension and use that as a jump point to
> make a per user policy?
Makes sense. Only when 3 is implemented, a special /proc interface is
required. For everything else, setrlimit will suffice.
>
> Cheers,
>
> Ahmed.
>
Bye,
Yoav
On Wed, 14 May 2003 06:06:56 -0400, Ahmed Masud wrote:
> On Wed, 14 May 2003, Yoav Weiss wrote:
> > On Tue, 13 May 2003, Ahmed Masud wrote:
> >
> > Yes, it sounds like an interesting project. Check out openbsd's paper
> > about this: http://www.openbsd.org/papers/swapencrypt.ps
>
> Thank you for this paper, it is a fun read. I do think however that a
> few implementation differences should take place:
>
> 1. We should not enforce Rijndael as the only choice.
>
> 2. Every page should be encrypted iff it marked with some flag. This gives
> a generic enough hook to create a swap_encrypt_policy type function to
> determine whether it is desirable to encrypt a particular page or not.
>
> [...]
Just browsed across the white paper, but this doesn't make a lot of
sense to me.
1. Instead of cryptographic filesystems, you could just encrypt the
block device.
2. The only reason not to do so it security. An attacker could use
known-plaintext attacks, since some parts of the metadata can be
reconstructed or guessed easily.
3. Instead of encrypted swap, you could just encrypt the block device.
4. The only reason reason not to do so is what?
Sorry, beats me, I cannot see any reason. Is there a possible
known-plaintext attack possible, that is not obvious to everyone, at
least not to me?
J?rn
--
A defeated army first battles and then seeks victory.
-- Sun Tzu
On Wed, 14 May 2003, [iso-8859-1] J?rn Engel wrote:
> sense to me.
>
> 1. Instead of cryptographic filesystems, you could just encrypt the
> block device.
> 2. The only reason not to do so it security. An attacker could use
> known-plaintext attacks, since some parts of the metadata can be
> reconstructed or guessed easily.
> 3. Instead of encrypted swap, you could just encrypt the block device.
> 4. The only reason reason not to do so is what?
>
The idea is to have encryption keys for the pages to be unique on a
per-uid per-process basis. So one user on the system cannot access (even
if they are root) parts of another's private data. To achieve this,
different parts of swap device need to be encrypted with different keys.
Ahmed.
On Wed, 14 May 2003 12:13:03 -0400, Ahmed Masud wrote:
>
> The idea is to have encryption keys for the pages to be unique on a
> per-uid per-process basis. So one user on the system cannot access (even
> if they are root) parts of another's private data. To achieve this,
> different parts of swap device need to be encrypted with different keys.
How do user *know* that root cannot simply bypass this security?
Root, god, what's the difference? ;-)
J?rn
--
"Error protection by error detection and correction."
-- from a university class
J?rn Engel wrote:
> On Wed, 14 May 2003 12:13:03 -0400, Ahmed Masud wrote:
>
>
> How do user *know* that root cannot simply bypass this security?
>
> Root, god, what's the difference? ;-)
Hahah so true so true ... but all gods fall if their worshipers stop
worshiping them ;-)
So ... there are ways. Root is only given this power because it is
allowed that by the operating system kernel. It is always possible to
shunt root out if the kernel chooses to do so. We can actually
construct environments where security officers != system administrators.
Ahmed.
On Wed, 14 May 2003, [iso-8859-1] J?rn Engel wrote:
> On Wed, 14 May 2003 12:13:03 -0400, Ahmed Masud wrote:
> >
> > The idea is to have encryption keys for the pages to be unique on a
> > per-uid per-process basis. So one user on the system cannot access (even
> > if they are root) parts of another's private data. To achieve this,
> > different parts of swap device need to be encrypted with different keys.
>
> How do user *know* that root cannot simply bypass this security?
>
> Root, god, what's the difference? ;-)
>
> J?rn
Well :-) that's sorta true. In the new world the old gods will fall to
give rise to new ones. worshippers of root will fade in the echos of the
past ... Rootshunting is possible if the kernel so chooses. Trusted Linux,
which is my perosnal and favourite focus for linux would be an environment
without root.
Ahmed.
On Wed, 14 May 2003, J?rn Engel wrote:
> On Wed, 14 May 2003 12:13:03 -0400, Ahmed Masud wrote:
> >
> > The idea is to have encryption keys for the pages to be unique on a
> > per-uid per-process basis. So one user on the system cannot access (even
> > if they are root) parts of another's private data. To achieve this,
> > different parts of swap device need to be encrypted with different keys.
>
> How do user *know* that root cannot simply bypass this security?
>
> Root, god, what's the difference? ;-)
Aside from what Ahmed said about about rootless systems, the per-process
encryption reduces the window of opportunity for attackers who gain root
(or physical access).
Try strings(1) on your swap device. You'll be surprised at what you find.
You'll probably recognize passwords you haven't useds for a long time, and
a lot of other stuff you didn't expect. Sometimes you can find whole ssh
sessions there, plaintext. (think xterm scroll buffer).
With per-process encryption, even if root decides to read the swap at some
point (evil admin or an attacker who 0wn3d the box), the leakage is
limited to processes currently running.
Yoav
Followup to: <[email protected]>
By author: "David Schwartz" <[email protected]>
In newsgroup: linux.dev.kernel
>
> I pointed out to them that any software mechanism I devised for shutting
> the system down would require that they had control over the system in order
> to invoke the mechanism.
>
> They thought about that for a moment and were about to find that the system
> did not meet the requirements. I pointed out that anyone could pull the plug
> or network cable if needed or shut the system down at the switch and that
> this could be accomplished even if they lost control over the system and
> would certainly stop it from sending any information. They then agreed that
> the system met that requirement.
>
"Sometimes it's possible to do in hardware what's impossible to do in
software." Physical access is a powerful discriminator :)
-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
On Wed, May 14, 2003 at 06:34:30AM -0400, Ahmed Masud wrote:
>
> Level of security is a matter of trust. Should the kernel trust a
> distribution provider? No, that is not a reasonable request, because we do
> not control their environment and evaluation proceedures and there are no
> guarentees between the channel that provides the operating system to the
> time it gets installed on a system.
I don't understand why people are willing to base security arguments
on some sort of bizarre adversarial relationship between the kernel and
the system tools.
No Unix (even a "secure" one) is designed to run all security-critical
code in the kernel. That would be a bad design anyway, since it would
run lots of code at an unwarranted privilege level. "login" is not
part of the kernel. "su" is not part of the kernel". The boot loader
is not part of the kernel. And so on.
There is no issue of "trust" between the kernel and the distribution
provider. The distribution provider provides a system, which (like all
Unix-derived systems) is modular and thus has multiple independent
components with security functions. The sum of those parts is what you
should evaluate for security. Yes, the system should include proper
isolation mechanisms to prevent improper privilege escalations. But it
doesn't make sense to even think about what the kernel should do when
the untrusted distribution provides a malicious "/sbin/init".
miket
On Wed, 14 May 2003, Mike Touloumtzis wrote:
> On Wed, May 14, 2003 at 06:34:30AM -0400, Ahmed Masud wrote:
> >
> > Level of security is a matter of trust. Should the kernel trust a
> > distribution provider? No, that is not a reasonable request, because we do
> > not control their environment and evaluation proceedures and there are no
> > guarentees between the channel that provides the operating system to the
> > time it gets installed on a system.
>
> I don't understand why people are willing to base security arguments
> on some sort of bizarre adversarial relationship between the kernel and
> the system tools.
>
> No Unix (even a "secure" one) is designed to run all security-critical
> code in the kernel. That would be a bad design anyway, since it would
> run lots of code at an unwarranted privilege level. "login" is not
> part of the kernel. "su" is not part of the kernel". The boot loader
> is not part of the kernel. And so on.
>
> There is no issue of "trust" between the kernel and the distribution
> provider. The distribution provider provides a system, which (like all
> Unix-derived systems) is modular and thus has multiple independent
> components with security functions. The sum of those parts is what you
> should evaluate for security. Yes, the system should include proper
> isolation mechanisms to prevent improper privilege escalations. But it
> doesn't make sense to even think about what the kernel should do when
> the untrusted distribution provides a malicious "/sbin/init".
Not even malicious. For years, it was accepted that if you had
physical possesion of a computing system, you could do anything
with it that it was capable of.
Not so, with the latest Red Hat distribution (9). You can no longer
set init=/bin/bash at the boot prompt.... well you can set it, but
then you get an error about killing init. This caused a neighbor
a lot of trouble when she accidentally put a blank line in the
top of /etc/passwd. Nobody could log-in. I promised to show her
how to "break in", but I wasn't able to. I had to take her hard-disk
to my house, mount it, and fix the password file. All these "attempts"
at so-called security do is make customers pissed.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
> Not so, with the latest Red Hat distribution (9). You can no longer
> set init=/bin/bash at the boot prompt.... well you can set it, but
> then you get an error about killing init. This caused a neighbor
> a lot of trouble when she accidentally put a blank line in the
> top of /etc/passwd. Nobody could log-in. I promised to show her
> how to "break in", but I wasn't able to. I had to take her hard-disk
> to my house, mount it, and fix the password file. All these "attempts"
> at so-called security do is make customers pissed.
>
1. Insert Live-System CD (Knoppix for example)
2. Boot from it.
3. Mount rootfs.
4. Fix things.
5. Remove CD and reboot.
On Thu, 15 May 2003, Yoav Weiss wrote:
> > Not so, with the latest Red Hat distribution (9). You can no longer
> > set init=/bin/bash at the boot prompt.... well you can set it, but
> > then you get an error about killing init. This caused a neighbor
> > a lot of trouble when she accidentally put a blank line in the
> > top of /etc/passwd. Nobody could log-in. I promised to show her
> > how to "break in", but I wasn't able to. I had to take her hard-disk
> > to my house, mount it, and fix the password file. All these "attempts"
> > at so-called security do is make customers pissed.
> >
>
> 1. Insert Live-System CD (Knoppix for example)
> 2. Boot from it.
> 3. Mount rootfs.
> 4. Fix things.
> 5. Remove CD and reboot.
>
Not so easy. Many persons have drivers that must be installed using
initrd (SCSI, Firewire, etc.) before their root file-systems are
accessible.
This means that there isn't a general-purpose tool that you can
take with you (except another PC, you can use to mount the
locked disk). NotGood(tm). Sun allows their install CD/ROM to
be used for repair. Other vendors did this also. New "security"
out of Red-Hat seems to prevent this, ALF-F2, etc., used to bring
up other virtual terminals. They don't anymore. I think every
OS vendor should be required to spend a few weeks in the field,
preferably in Afghanistan <grin>, before they even consider mucking
with accepted principles.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
Chris Siebenmann wrote:
> | The people who used to require that still have lists of approved
> | operating systems. Linux is not on that list.
>
> To be blunt: and?
> Linux is not and never will be all things to all people. (Some of that
> can be seen in the ongoing discussions over, eg, LSM.)
> There are many features that have far too small a target market to be
> of interest in mainline Unix.
Large organizations like to standardize things wherever possible.
Given an OS that is well suited for a fairly secure environment
that also runs widely-available office software, they will adopt it
for both uses, thus locking out other operating systems.
Many of these decisions are made by Dumb White Guys sitting around
a boardroom table looking at feature lists, and pushed by even dumber
3-letter consulting firms whose technical representatives say things
like "Yes, we will be decrypting all SSL sessions at the firewall to
check for viruses."
So I think Linux needs these 'fringe' features if it's going to
continue to expand its user base in the face of such stupidity.
> Chris Siebenmann wrote:
> Many of these decisions are made by Dumb White Guys sitting around
> a boardroom table looking at feature lists, and pushed by even dumber
> 3-letter consulting firms whose technical representatives say things
> like "Yes, we will be decrypting all SSL sessions at the firewall to
> check for viruses."
> So I think Linux needs these 'fringe' features if it's going to
> continue to expand its user base in the face of such stupidity.
I, for one, completely disagree in the strongest way possible. This whole
argument style rings entirely hollow with me. I'd much rather say, "We don't
do that because it's stupid. We will gladly explain to you why we think it's
stupid, what you really want, and how to get that from us."
Deliberately designing in misfeatures so that dumb people will get what
they think they want is architectural suicide. I hope Linux never moves in
that direction.
It's better to refuse to do things until and unless you're totally
convinced they are the right thing to do. That way, people know that
anything you've done is right.
There will always be some friction between those who want to see Linux run
by as many people and machines as possible and those who think various other
things are more important. I'm squarely in the "make the best possible OS"
camp. If people don't want to run it, that's their loss.
DS
On Wed, 14 May 2003, Mike Touloumtzis wrote:
> On Wed, May 14, 2003 at 06:34:30AM -0400, Ahmed Masud wrote:
> >
> I don't understand why people are willing to base security arguments
> > Level of security is a matter of trust. Should the kernel trust a
> > distribution provider? No, that is not a reasonable request, because we do
> > not control their environment and evaluation proceedures and there are no
> > guarentees between the channel that provides the operating system to the
> > time it gets installed on a system.
>
> on some sort of bizarre adversarial relationship between the kernel and
> the system tools.
>
> No Unix (even a "secure" one) is designed to run all security-critical
> code in the kernel. That would be a bad design anyway, since it would
> run lots of code at an unwarranted privilege level. "login" is not
> part of the kernel. "su" is not part of the kernel". The boot loader
> is not part of the kernel. And so on.
There no relationship in what I had said above and what you are saying
here. When i state that you cannot trust the operating environment
user-space applications, i exactly mean that. It has nothing to do with
running su or login as part of the kernel. On the contrary, I am even
actually saying they shouldn't even run with full root privileges on a
system? (ref. Linux capabilities interface).
>
> There is no issue of "trust" between the kernel and the distribution
> provider. The distribution provider provides a system, which (like all
> Unix-derived systems) is modular and thus has multiple independent
Well ofcourse there is a degree of trust as soon as you accept a
particular vendor's distribution. What i am suggesting is a mechanism to
throttle that trust as per the sensitivity of application, and the one
place you can do that effectively is the kernel.
> components with security functions. The sum of those parts is what you
> should evaluate for security. Yes, the system should include proper
> isolation mechanisms to prevent improper privilege escalations. But it
> doesn't make sense to even think about what the kernel should do when
> the untrusted distribution provides a malicious "/sbin/init".
We don't want to just deal with "improper" escalations, rather we want to
deal with realtime decisions on what may be the "proper" escalation or
deescalation along the timeline of running system.
Not sure if what i am attempting to say is clear from above, but i am not
suggesting about moving any thing from userspace into kernel space. I am
just suggesting that the kernel should provide non binary throttling of
the level of trust that may be placed in various components that interact
with it.
Ahmed.
On Wed, 14 May 2003 21:59:47 +0300, Yoav Weiss wrote:
> On Wed, 14 May 2003, J?rn Engel wrote:
> > On Wed, 14 May 2003 12:13:03 -0400, Ahmed Masud wrote:
> > >
> > > The idea is to have encryption keys for the pages to be unique on a
> > > per-uid per-process basis. So one user on the system cannot access (even
> > > if they are root) parts of another's private data. To achieve this,
> > > different parts of swap device need to be encrypted with different keys.
> >
> > How do user *know* that root cannot simply bypass this security?
> >
> > Root, god, what's the difference? ;-)
>
> Aside from what Ahmed said about about rootless systems, the per-process
> encryption reduces the window of opportunity for attackers who gain root
> (or physical access).
>
> Try strings(1) on your swap device. You'll be surprised at what you find.
> You'll probably recognize passwords you haven't useds for a long time, and
> a lot of other stuff you didn't expect. Sometimes you can find whole ssh
> sessions there, plaintext. (think xterm scroll buffer).
>
> With per-process encryption, even if root decides to read the swap at some
> point (evil admin or an attacker who 0wn3d the box), the leakage is
> limited to processes currently running.
s/currently running/running now or in the future/
But apart from that, it does really reduce the window, agreed.
An alternative approach would simply zero all freed memory in the
system, with almost identical effects. Almost means you are missing
memory (that isn't cleared on reboot on all systems, ...) and this is
missing hard disk recovery that can read data already overwritten.
Arguments against this simpler approach?
J?rn
--
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
David Schwartz wrote:
>> So I think Linux needs these 'fringe' features if it's going to
>> continue to expand its user base in the face of such stupidity.
>
> I, for one, completely disagree in the strongest way possible. This whole
> argument style rings entirely hollow with me. I'd much rather say, "We don't
> do that because it's stupid. We will gladly explain to you why we think it's
> stupid, what you really want, and how to get that from us."
>
> Deliberately designing in misfeatures so that dumb people will get what
> they think they want is architectural suicide. I hope Linux never moves in
> that direction.
Don't get me wrong -- I don't think high-security options are misfeatures.
I'm just trying to say that such options, even if only rarely used,
are critical to gaining wide acceptance. Just because dumb people require
them on their standard OS doesn't mean the features themselves are stupid...
> s/currently running/running now or in the future/
Hopefully near future only, assuming you have a proper log/alert system
that sends logs to another machine.
(and if you don't and your box stays owned, you have bigger problems
than the swap sniffing).
>
> But apart from that, it does really reduce the window, agreed.
>
> An alternative approach would simply zero all freed memory in the
> system, with almost identical effects. Almost means you are missing
> memory (that isn't cleared on reboot on all systems, ...) and this is
> missing hard disk recovery that can read data already overwritten.
>
> Arguments against this simpler approach?
The performance impact for one. My systems often has processes taking
hundreds of megabytes in swap. If we'd have to wipe all this space on the
disk whenever such process dies, the system would be unusable.
Second, see previous posts on this thread re hardware issues when writing
zero blocks to some disks. In short, you're never sure its really clean.
Third, in case of crash (i.e. when someone pulls the plug and steals the
server), the system had no chance to clean the swap.
Encrypted swap solves all that.
>
> J?rn
>
> --
> Rules of Optimization:
> Rule 1: Don't do it.
> Rule 2 (for experts only): Don't do it yet.
> -- M.A. Jackson
>
On Wednesday 14 May 2003 16:32, Richard B. Johnson wrote:
> On Wed, 14 May 2003, Mike Touloumtzis wrote:
> > On Wed, May 14, 2003 at 06:34:30AM -0400, Ahmed Masud wrote:
> > > Level of security is a matter of trust. Should the kernel trust a
> > > distribution provider? No, that is not a reasonable request, because we
> > > do not control their environment and evaluation proceedures and there
> > > are no guarentees between the channel that provides the operating
> > > system to the time it gets installed on a system.
> >
> > I don't understand why people are willing to base security arguments
> > on some sort of bizarre adversarial relationship between the kernel and
> > the system tools.
> >
> > No Unix (even a "secure" one) is designed to run all security-critical
> > code in the kernel. That would be a bad design anyway, since it would
> > run lots of code at an unwarranted privilege level. "login" is not
> > part of the kernel. "su" is not part of the kernel". The boot loader
> > is not part of the kernel. And so on.
> >
> > There is no issue of "trust" between the kernel and the distribution
> > provider. The distribution provider provides a system, which (like all
> > Unix-derived systems) is modular and thus has multiple independent
> > components with security functions. The sum of those parts is what you
> > should evaluate for security. Yes, the system should include proper
> > isolation mechanisms to prevent improper privilege escalations. But it
> > doesn't make sense to even think about what the kernel should do when
> > the untrusted distribution provides a malicious "/sbin/init".
>
> Not even malicious. For years, it was accepted that if you had
> physical possesion of a computing system, you could do anything
> with it that it was capable of.
>
> Not so, with the latest Red Hat distribution (9). You can no longer
> set init=/bin/bash at the boot prompt.... well you can set it, but
> then you get an error about killing init. This caused a neighbor
> a lot of trouble when she accidentally put a blank line in the
> top of /etc/passwd. Nobody could log-in. I promised to show her
> how to "break in", but I wasn't able to. I had to take her hard-disk
> to my house, mount it, and fix the password file. All these "attempts"
> at so-called security do is make customers pissed.
I fix those errors with by booting the Slackware CD with the live
filesystem...
No dependancies on any of the regular disks - then I can fix anything within
reason (haven't tried md raids though).
On Thu, 15 May 2003, Jesse Pollard wrote:
> On Wednesday 14 May 2003 16:32, Richard B. Johnson wrote:
> > Not so, with the latest Red Hat distribution (9). You can no longer
> > set init=/bin/bash at the boot prompt.... well you can set it, but
> > then you get an error about killing init. This caused a neighbor
> > a lot of trouble when she accidentally put a blank line in the
> > top of /etc/passwd. Nobody could log-in. I promised to show her
> > how to "break in", but I wasn't able to. I had to take her hard-disk
> > to my house, mount it, and fix the password file. All these "attempts"
> > at so-called security do is make customers pissed.
>
> I fix those errors with by booting the Slackware CD with the live
> filesystem...
>
> No dependancies on any of the regular disks - then I can fix anything within
> reason (haven't tried md raids though).
You don't have to do that. Richard is mis-informed. Any of the following
still work on Red Hat Linux 9:
init=/bin/bash # drops you straight to a bash shell
init 1 # runs runlevel 1 SysV init scripts and rc.sysinit
init single # runs rc.sysinit, but not runlevel 1
init emergency # runs a shell
all without going to rescue media.
later,
chris
On Thu, 15 May 2003, Richard B. Johnson wrote:
> > You don't have to do that. Richard is mis-informed. Any of the following
> > still work on Red Hat Linux 9:
> >
> > init=/bin/bash # drops you straight to a bash shell
> > init 1 # runs runlevel 1 SysV init scripts and rc.sysinit
> > init single # runs rc.sysinit, but not runlevel 1
> > init emergency # runs a shell
> >
> > all without going to rescue media.
> >
>
> Bullshit. Try it.
I just did. It works.
If it's not working for you, it's because *you* did something (like, say,
password your boot loader). By default, it's still possible on Red Hat Linux
9 out of the box.
later,
chris
On Thu, 15 May 2003, Chris Ricker wrote:
> On Thu, 15 May 2003, Jesse Pollard wrote:
>
> > On Wednesday 14 May 2003 16:32, Richard B. Johnson wrote:
> > > Not so, with the latest Red Hat distribution (9). You can no longer
> > > set init=/bin/bash at the boot prompt.... well you can set it, but
> > > then you get an error about killing init. This caused a neighbor
> > > a lot of trouble when she accidentally put a blank line in the
> > > top of /etc/passwd. Nobody could log-in. I promised to show her
> > > how to "break in", but I wasn't able to. I had to take her hard-disk
> > > to my house, mount it, and fix the password file. All these "attempts"
> > > at so-called security do is make customers pissed.
> >
> > I fix those errors with by booting the Slackware CD with the live
> > filesystem...
> >
> > No dependancies on any of the regular disks - then I can fix anything within
> > reason (haven't tried md raids though).
>
> You don't have to do that. Richard is mis-informed. Any of the following
> still work on Red Hat Linux 9:
>
> init=/bin/bash # drops you straight to a bash shell
> init 1 # runs runlevel 1 SysV init scripts and rc.sysinit
> init single # runs rc.sysinit, but not runlevel 1
> init emergency # runs a shell
>
> all without going to rescue media.
>
Bullshit. Try it. So called Linux 9 /Professional did **NOT** allow
anybody to break in. Maybe you didn't try it, or maybe it wasn't
tested in a machine that uses initrd to make the hard disk accessible,
but it absolutely positively fails to run any 'init' when the LILO boot
is interrupted and we entered, in addition to the linux OS label, the
parameter init=/bin/bash. We even tried, probably nearly a hundred
boots, over two days various things like init=/bin/csh.... various
possible shells, plus init=/sbin/init 1, etc. Any time 'init' was
defined on the boot command-line, the machine would panic with
'attempting to kill init'.
Also, when booting on the distribution media, the ALF-F2...F5 keys
no longer function so you can't access a shell. Try it before you
claim I'm "mis-informed". I wasted an entire weekend until I took
the damn hard disk out, brought it home, and "fixed" it.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Thu, 15 May 2003, Chris Ricker wrote:
> On Thu, 15 May 2003, Richard B. Johnson wrote:
>
> > > You don't have to do that. Richard is mis-informed. Any of the following
> > > still work on Red Hat Linux 9:
> > >
> > > init=/bin/bash # drops you straight to a bash shell
> > > init 1 # runs runlevel 1 SysV init scripts and rc.sysinit
> > > init single # runs rc.sysinit, but not runlevel 1
> > > init emergency # runs a shell
> > >
> > > all without going to rescue media.
> > >
> >
> > Bullshit. Try it.
>
> I just did. It works.
>
> If it's not working for you, it's because *you* did something (like, say,
> password your boot loader). By default, it's still possible on Red Hat Linux
> 9 out of the box.
>
> later,
> chris
> -
Sill bullshit. I did nothing except to try to help a neighbor
who got locked out of her machine. I spent most of Saturday
and all of Sunday trying to break in. The LILO command prompt
readily took "parameters". However, any parameter passed on
the command-line resulted in a try-to-kill init error. Her
machine uses an Adaptec SCSI controller which needs to be
loaded via initrd to make the root file-system available.
I tried Red-Hat 8.0 here at work. It works as you described.
There is no problem with it and LILO and initrd. However
Red Hat 9/Professional does not allow break-in, at least on
the machine tested...and I have several pissed off witnesses.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Thu, 15 May 2003, Richard B. Johnson wrote:
> On Thu, 15 May 2003, Chris Ricker wrote:
>
> > On Thu, 15 May 2003, Richard B. Johnson wrote:
> >
> > > > You don't have to do that. Richard is mis-informed. Any of the following
> > > > still work on Red Hat Linux 9:
> > > >
> > > > init=/bin/bash # drops you straight to a bash shell
> > > > init 1 # runs runlevel 1 SysV init scripts and rc.sysinit
> > > > init single # runs rc.sysinit, but not runlevel 1
> > > > init emergency # runs a shell
> > > >
> > > > all without going to rescue media.
> > > >
> > >
>
> Sill bullshit. I did nothing except to try to help a neighbor
> who got locked out of her machine. I spent most of Saturday
> and all of Sunday trying to break in. The LILO command prompt
> readily took "parameters". However, any parameter passed on
> the command-line resulted in a try-to-kill init error. Her
> machine uses an Adaptec SCSI controller which needs to be
> loaded via initrd to make the root file-system available.
>
> I tried Red-Hat 8.0 here at work. It works as you described.
> There is no problem with it and LILO and initrd. However
> Red Hat 9/Professional does not allow break-in, at least on
> the machine tested...and I have several pissed off witnesses.
>
This is an issue directly to be taken up with the manufacturer and is a
bit off topic here. RedHat 9 professional can conceivably disallow
parameter passing. The fact that a system does not, out of the box, allow
for break-ins is a good thing(tm) for most part... manually editing
password files is a bad thing(tm), this is why command line tools such as
useradd are provided.
Any how, I hope you got it all fixed.
Cheers,
Ahmed.
Mike Touloumtzis wrote:
> No Unix (even a "secure" one) is designed to run all security-critical
> code in the kernel. That would be a bad design anyway, since it would
> run lots of code at an unwarranted privilege level. "login" is not
> part of the kernel. "su" is not part of the kernel".
Yes, but "elsewhere" I can audit the system and see which programs
and subsystems are authorized to logon users and the authentication
methods they can use. Note that the below output is not from some
"security enhanced" or "server" version of the OS but rather from
a $179 upgrade I bought at the local Staples store:
2003-05-16 05:06:25 Security Success Audit System Event 515 NT AUTHORITY\SYSTEM
"A trusted logon process has registered with the Local Security Authority.
This logon process will be trusted to submit logon requests.
Logon Process Name: Protected Storage Service "
2003-05-16 05:06:24 Security Success Audit System Event 515 NT AUTHORITY\SYSTEM
"A trusted logon process has registered with the Local Security Authority.
This logon process will be trusted to submit logon requests.
Logon Process Name: LAN Manager Workstation Service "
2003-05-16 05:06:17 Security Success Audit System Event 518 NT AUTHORITY\SYSTEM
"An notification package has been loaded by the Security Account Manager.
This package will be notified of any account or password changes.
Notification Package Name: scecli "
2003-05-16 05:06:17 Security Success Audit System Event 515 NT AUTHORITY\SYSTEM
"A trusted logon process has registered with the Local Security Authority.
This logon process will be trusted to submit logon requests.
Logon Process Name: Service Control Manager "
2003-05-16 05:06:17 Security Success Audit System Event 515 NT AUTHORITY\SYSTEM
"A trusted logon process has registered with the Local Security Authority.
This logon process will be trusted to submit logon requests.
Logon Process Name: Winlogon\MSGina "
2003-05-16 05:06:17 Security Success Audit System Event 515 NT AUTHORITY\SYSTEM
"A trusted logon process has registered with the Local Security Authority.
This logon process will be trusted to submit logon requests.
Logon Process Name: KSecDD "
2003-05-16 05:06:17 Security Success Audit System Event 514 NT AUTHORITY\SYSTEM
"An authentication package has been loaded by the Local Security Authority.
This authentication package will be used to authenticate logon attempts.
Authentication Package Name: D:\WINNT\system32\msv1_0.dll : MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 "
2003-05-16 05:06:17 Security Success Audit System Event 514 NT AUTHORITY\SYSTEM
"An authentication package has been loaded by the Local Security Authority.
This authentication package will be used to authenticate logon attempts.
Authentication Package Name: D:\WINNT\system32\schannel.dll : Microsoft Unified Security Protocol Provider "
2003-05-16 05:06:17 Security Success Audit System Event 514 NT AUTHORITY\SYSTEM
"An authentication package has been loaded by the Local Security Authority.
This authentication package will be used to authenticate logon attempts.
Authentication Package Name: D:\WINNT\system32\msv1_0.dll : NTLM "
2003-05-16 05:06:17 Security Success Audit System Event 514 NT AUTHORITY\SYSTEM
"An authentication package has been loaded by the Local Security Authority.
This authentication package will be used to authenticate logon attempts.
Authentication Package Name: D:\WINNT\system32\kerberos.dll : Kerberos "
2003-05-16 05:06:17 Security Success Audit System Event 514 NT AUTHORITY\SYSTEM
"An authentication package has been loaded by the Local Security Authority.
This authentication package will be used to authenticate logon attempts.
Authentication Package Name: D:\WINNT\system32\LSASRV.dll : Negotiate "
Hi.
On Wed, 2003-05-14 at 11:58, Yoav Weiss wrote:
> Actually, I forgot that swsusp is now included. I haven't tried it in a
> while. Anyone knows if its stable enough to start playing with encrypting
> it ?
Sorry for the slow response - I guess Pavel didn't notice your question
either. In it's current form in the 2.5 kernel, swsusp is stable enough
to try encrypting the data. However you might want to wait as the 2.4
version is nearly at its 1.0 release, and the plan is for me to then
start submitting a whole swag of patches that will make the code much
more feature complete. The 2.4 code includes support for compressing the
image; I guess we'd want to hook encryption in at the same point (it
will use BIO calls, not the swap read/write routines).
Regards,
Nigel
--
Nigel Cunningham
495 St Georges Road South, Hastings 4201, New Zealand
You see, at just the right time, when we were still powerless,
Christ died for the ungodly.
-- Romans 5:6, NIV.
On Fri, 13 Jun 2003, Nigel Cunningham wrote:
> Hi.
>
> On Wed, 2003-05-14 at 11:58, Yoav Weiss wrote:
> > Actually, I forgot that swsusp is now included. I haven't tried it in a
> > while. Anyone knows if its stable enough to start playing with encrypting
> > it ?
>
> Sorry for the slow response - I guess Pavel didn't notice your question
> either. In it's current form in the 2.5 kernel, swsusp is stable enough
> to try encrypting the data. However you might want to wait as the 2.4
> version is nearly at its 1.0 release, and the plan is for me to then
> start submitting a whole swag of patches that will make the code much
> more feature complete. The 2.4 code includes support for compressing the
> image; I guess we'd want to hook encryption in at the same point (it
> will use BIO calls, not the swap read/write routines).
>
Sounds great. I'll wait until the patches are submitted before
introducing any changes. And the compression hooks sound like the right
place to add encryption. Just be sure to encrypt AFTER compression and
not vice versa.
Yoav Weiss