2019-10-25 09:38:31

by Paul Menzel

[permalink] [raw]
Subject: File system for scratch space (in HPC cluster)

Dear Linux folks,


In our cluster, we offer scratch space for temporary files. As
these files are temporary, we do not need any safety
requirements – especially not those when the system crashes or
shuts down. So no `sync` is for example needed.

Are there file systems catering to this need? I couldn’t find
any? Maybe I missed some options for existing file systems.


Kind regards,

Paul


Attachments:
smime.p7s (5.05 kB)
S/MIME Cryptographic Signature

2019-10-25 17:58:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

On Thu, Oct 24, 2019 at 12:43:40PM +0200, Paul Menzel wrote:
>
> In our cluster, we offer scratch space for temporary files. As
> these files are temporary, we do not need any safety
> requirements – especially not those when the system crashes or
> shuts down. So no `sync` is for example needed.
>
> Are there file systems catering to this need? I couldn’t find
> any? Maybe I missed some options for existing file systems.

You could use ext4 in nojournal mode. If you want to make sure that
fsync() doesn't force a cache flush, you can mount with the nobarrier
mount option.

- Ted

2019-10-25 17:58:02

by Boaz Harrosh

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

On 24/10/2019 17:55, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 12:43:40PM +0200, Paul Menzel wrote:
>>
>> In our cluster, we offer scratch space for temporary files. As
>> these files are temporary, we do not need any safety
>> requirements – especially not those when the system crashes or
>> shuts down. So no `sync` is for example needed.
>>
>> Are there file systems catering to this need? I couldn’t find
>> any? Maybe I missed some options for existing file systems.
>
> You could use ext4 in nojournal mode. If you want to make sure that
> fsync() doesn't force a cache flush, you can mount with the nobarrier
> mount option.
>

And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.

I think xfs for O_TMPFILE|O_EXCL does not do any fsync, but I'm
not sure

> - Ted
>

2019-10-25 18:49:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

On Oct 24, 2019, at 4:43 AM, Paul Menzel <[email protected]> wrote:
>
> Dear Linux folks,
>
>
> In our cluster, we offer scratch space for temporary files. As
> these files are temporary, we do not need any safety
> requirements – especially not those when the system crashes or
> shuts down. So no `sync` is for example needed.
>
> Are there file systems catering to this need? I couldn’t find
> any? Maybe I missed some options for existing file systems.

How big do you need the scratch filesystem to be? Is it local
to the node or does it need to be shared between nodes? If it
needs to be large and shared between nodes then Lustre is typically
used for this. If it is local and relatively small you could
consider using tmpfs backed by swab on an NVMe flash device
(M.2 or U.2, Optane if you can afford it) inside the node.

That way you get RAM-like performance for many files, with a
larger capacity than RAM when needed (tmpfs can use swap).

You might consider to mount a new tmpfs filesystem per job (no
formatting is needed for tmpfs), and then unmount it when the job
is done, so that the old files are automatically cleaned up.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-10-25 19:05:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

On Thu, Oct 24, 2019 at 06:01:05PM +0300, Boaz Harrosh wrote:
> > You could use ext4 in nojournal mode. If you want to make sure that
> > fsync() doesn't force a cache flush, you can mount with the nobarrier
> > mount option.
>
> And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.

O_TMPFILE means that there is no directory entry created. The
pathname passed to the open system call is the directory specifying
the file system where the temporary file will be created.

This may or may not be what the original poster wanted, depending on
whether by "scratch file" he meant a file which could be opened by
pathname by another, subsequent process or not.

- Ted

2019-10-25 19:30:32

by Paul Menzel

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

Dear Andreas,


On 2019-10-24 19:51, Andreas Dilger wrote:
> On Oct 24, 2019, at 4:43 AM, Paul Menzel <[email protected]>
> wrote:

>> In our cluster, we offer scratch space for temporary files. As
>> these files are temporary, we do not need any safety requirements
>> – especially not those when the system crashes or shuts down. So
>> no `sync` is for example needed.
>>
>> Are there file systems catering to this need? I couldn’t find any?
>> Maybe I missed some options for existing file systems.
>
> How big do you need the scratch filesystem to be? Is it local to
> the node or does it need to be shared between nodes?

In this case local.

> If it needs to be large and shared between nodes then Lustre is
> typically used for this. If it is local and relatively small you
> could consider using tmpfs backed by swab on an NVMe flash device
> (M.2 or U.2, Optane if you can afford it) inside the node.
>
> That way you get RAM-like performance for many files, with a larger
> capacity than RAM when needed (tmpfs can use swap).
>
> You might consider to mount a new tmpfs filesystem per job (no
> formatting is needed for tmpfs), and then unmount it when the job is
> done, so that the old files are automatically cleaned up.
That is a good idea, but probably not practical for 10 TB. Out of
curiosity, what is the limit for “relatively small” in your
experience?


Kind regards,

Paul


Attachments:
smime.p7s (5.05 kB)
S/MIME Cryptographic Signature

2019-10-25 22:16:59

by Paul Menzel

[permalink] [raw]
Subject: Re: File system for scratch space (in HPC cluster)

Dear Boaz, dear Theodore,


Thank you for your replies.

On 2019-10-24 22:34, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 06:01:05PM +0300, Boaz Harrosh wrote:
>>> You could use ext4 in nojournal mode. If you want to make sure that
>>> fsync() doesn't force a cache flush, you can mount with the nobarrier
>>> mount option.

Yeah, that is the current settings we use.

>> And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.
>
> O_TMPFILE means that there is no directory entry created. The
> pathname passed to the open system call is the directory specifying
> the file system where the temporary file will be created.

Interesting.

The main problem is, that we can’t control what the users put into the
cluster, so a mount option is needed.

> This may or may not be what the original poster wanted, depending on
> whether by "scratch file" he meant a file which could be opened by
> pathname by another, subsequent process or not.

Yeah, the scientists send often scripts, where they access the files
by subsequent processes.


Kind regards,

Paul


Attachments:
smime.p7s (5.05 kB)
S/MIME Cryptographic Signature