File systems and storage

Overview

The HPC offers a variety of systems for storage and archiving of data. With a few noted exceptions, all systems below are network-attached, meaning they are accessible from all compute and login nodes.

Data Storage Systems

Below is a summary of the primary locations in the cluster file system that users will interact with.

Active work spaces

The following spaces are suitable for use for active data analysis.

/home/FCAM/<user>

Quota: 25GB. Generally cannot be expanded.

All users will have a home directory when their account is created.

This location is primarily intended for storage of small files such as scripts, source code, configuration, software installations, and can accommodate a small amount of data.

Use other spaces for analysis of large datasets.

Note

Users are not permitted to open permissions on their home directories. If you need to share files with other users, try any of the below directories you have access to. If you open permissions on your home directory, it may be locked down.

/labs

Quota: 2TB. Can be increased upon request.

Individual users will always be associated with “lab” groups. PI of the lab group can request a space in /labs.

This location is intended for storage of active shared project data to facilitate collaboration among members of a PI’s lab. You will need to be part of the PI’s permission group to access a lab directory.

/projects

Quota: 2TB. Can be increased upon request.

This can be created at the request of one or more PIs.

This location is intended for storage of active shared project data to facilitate collaboration among multiple groups on a project.

/scratch

Quota: Shared space. No individual or group quotas.

Anyone can create any number of directories here and set permissions to whatever they choose.

This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. As of last update, /scratch had 84TB of space. Please be considerate of other users of this shared resource.

Warning

In /scratch unused files are subject to deletion without notice after 90 days.

/sandbox

Quota: Shared space. No individual or group quotas.

This system has identical policies to /scratch.

Anyone can create any number of directories here and set permissions to whatever they choose.

This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. Please be considerate of other users of this shared resource.

Warning

In /sandbox unused files are subject to deletion without notice after 90 days.

/seqdata

Quota: Shared space. No individual or group quotas. Users do not have write access.

This location is used for storing raw sequencing data and is intended to alleviate strain on quotas in other locations and to discourage data duplication. Data are stored as read-only. You may request that raw data be stored here (all data generated by CGI is stored here by default), but CBC will have to move it there for you, as users do not have write access.

If you would like to move existing raw data to this location or store data from an external sequencing center, please contact us at cbcsupport@helpspotmail.com.

Raw sequence data will be stored here for a period of 2 years. After this time, the data will be moved to /tapearchive/seqdata. See below for more information on /tapearchive.

Do not copy data from this directory. For convenience, consider making a symlink to files stored here. By avoiding unnecessary copying, accidental data corruption or deletion can be avoided and space will be used more efficiently on the cluster and within directories you own.

Tip

Instead of copying data from /seqdata, create a symlink.

A symlink, or symbolic link is a pointer to a file or directory that behaves like the real thing.

If you want to use data from /seqdata as if it were in one of your project directories you can create a symlink like this:

ln -s /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025/ /home/FCAM/<user>/AwesomeProject/rawData/

Where /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025 is the raw data directory and /home/FCAM/<user>/AwesomeProject/rawData/ is the destination. You will then have a symlink, /home/FCAM/<user>/AwesomeProject/rawData/AwesomeWGS_May2025 that behaves just as if the files contained had been copied without taking up all the space.

/tmp

Quota: Shared space. No individual or group quotas.

Each node has it’s own local /tmp directory which can be used for temporary storage of data while actively running analyses. This can be desirable when an analysis would be limited by I/O operations across the network.

/tmp is a shared space. Once an analysis is done, nothing should be left behind here.

Tip

On our system /tmp is pretty small and often fills up. Many programs quietly write to /tmp by default without telling users (they usually quietly clean up afterward as well). If /tmp fills up, no space left on device errors frequently result, confusing users who know they still have plenty of space in their storage quota. If you are running a program that does this, say:

genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta

You can usually get it to write to another temporary directory by setting and exporting the variable TMPDIR to somewhere else before running it:

export TMPDIR=myNewTmpDir # this directory must exist or an error will result

genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta

You can create your own temporary directory where you are doing the analysis, or you can use /local.

/local

Quota: Shared space. No individual or group quotas.

This is another space local to each node and shared among users. It is larger than /tmp but files should be removed immediately after analysis.

Archival spaces

These spaces are meant for archiving data for varying durations. Users can request the creation of directories they or a group of users will have read/write access to.

Each of the archival spaces have the same directory structure containing the following directories:

users Owned by users
labs Shared by users of a PI’s lab
projects Shared by users belonging to multiple lab groups collaborating on a specific project
departments Shared by users of a department

/archive

Quota: No current quota policy.

This system is intended for medium term storage of data that is no longer being used for analysis, but which may be needed within a year or so. /archive gives relatively fast access (though not as fast as the above directories). The data are securely stored, being geospread across four data centers. As such, it is relatively expensive for us to store data here. Users should not plan on permanently archiving data here.

Eventually, data stored here will be moved to tape storage.

/tapebackup

Quota: No current quota policy.

This system is intended for short term storage (< 1 year). Think of it as a temporary storage space for files you don’t currently need, but are not ready to delete.

This system stores data on magnetic tape. Data stored here goes on a single magnetic tape and is not backed up, though magnetic tape is highly stable.

Access to magnetic tape is slow. Depending on how busy the system is, it may take hours or days for data to be written or retrieved. It is, however, inexpensive.

/tapearchive

Quota: No current quota policy.

This system is intended for long term storage. Completed projects and datasets that are no longer currently active can be stored here.

It also stores data on magnetic tape, but on two redundant copies in the same physical location. Like with /tapebackup, access is slow.

This system also includes a seqdata directory where data from /seqdata are moved after a period of 2 years. Users may request that data be moved back to /seqdata for a period of time if they need to actively the the data again for anlyses.

Summary table

Directory Purpose Access Quota/Capacity Notes
/home/FCAM/<user> Personal home directory for code, config, small data Created for each user 25GB Permissions must remain private. For sharing, use /labs or /projects.
/scratch Temporary storage for active analyses Open to all users 84TB (shared) Files may be deleted after 90 days.
/sandbox Temporary space Same as /scratch Bigger than scratch Files may be deleted after 90 days.
/seqdata Central store for raw/original sequencing data Read-only for users NA Use symlinks instead of copying. Contact CBC to store new data.
/tmp Node-local temporary space Shared per node Small May fill up unexpectedly. Set TMPDIR to redirect temp files.
/local Larger node-local temp space Shared per node Larger than /tmp Clean up after use.
/archive Medium-term storage (up to ~1 year) Request via CBC, access varies NA Slower than active spaces, geospread, more expensive.
/tapebackup Long-term storage (1+ years) Request via CBC, access varies NA Stored on single magnetic tape, not redundant, very slow access.
/tapearchive Very long-term or permanent storage Request via CBC, access varies NA Redundant magnetic tape storage, very slow access.

Quotas

To check current usage against quotas for /labs and /projects directories visit this link. You must be connected to the CAM VPN to load the site.

Restoring lost data

A snapshot of systems used for active work (not archival systems) is made each day. Snapshots are preserved for 10 days. You can access this backup to restore files and directories that were lost or corrupted.

To recover a file or directory from a snapshot:

  1. Navigate to the parent directory which contained the file or directory that you want to recover.
  2. Enter the snapshots directory with cd .snapshot.
  3. See the available snapshots with ls -l.
  4. Choose the desired snapshot date and cd <name>.
  5. Copy the file you wish to recover to a location outside of the snapshot cp <file> <destination>.
Note

The .snapshot directory will not be visible in the output of the ls command.