File systems and storage
Overview
The HPC offers a variety of systems for storage and archiving of data. With a few noted exceptions, all systems below are network-attached, meaning they are accessible from all compute and login nodes.
Data Storage Systems
Below is a summary of the primary locations in the cluster file system that users will interact with.
Active work spaces
The following spaces are suitable for use for active data analysis.
/home/FCAM/<user>
Quota: 25GB. Generally cannot be expanded.
All users will have a home directory when their account is created.
This location is primarily intended for storage of small files such as scripts, source code, configuration, software installations, and can accommodate a small amount of data.
Use other spaces for analysis of large datasets.
Users are not permitted to open permissions on their home directories. If you need to share files with other users, try any of the below directories you have access to. If you open permissions on your home directory, it may be locked down.
/labs
Quota: 2TB. Can be increased upon request.
Individual users will always be associated with “lab” groups. PI of the lab group can request a space in /labs
.
This location is intended for storage of active shared project data to facilitate collaboration among members of a PI’s lab. You will need to be part of the PI’s permission group to access a lab directory.
/projects
Quota: 2TB. Can be increased upon request.
This can be created at the request of one or more PIs.
This location is intended for storage of active shared project data to facilitate collaboration among multiple groups on a project.
/scratch
Quota: Shared space. No individual or group quotas.
Anyone can create any number of directories here and set permissions to whatever they choose.
This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. As of last update, /scratch
had 84TB of space. Please be considerate of other users of this shared resource.
In /scratch
unused files are subject to deletion without notice after 90 days.
/sandbox
Quota: Shared space. No individual or group quotas.
This system has identical policies to /scratch
.
Anyone can create any number of directories here and set permissions to whatever they choose.
This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. Please be considerate of other users of this shared resource.
In /sandbox
unused files are subject to deletion without notice after 90 days.
/seqdata
Quota: Shared space. No individual or group quotas. Users do not have write access.
This location is used for storing raw sequencing data and is intended to alleviate strain on quotas in other locations and to discourage data duplication. Data are stored as read-only. You may request that raw data be stored here (all data generated by CGI is stored here by default), but CBC will have to move it there for you, as users do not have write access.
If you would like to move existing raw data to this location or store data from an external sequencing center, please contact us at cbcsupport@helpspotmail.com
.
Raw sequence data will be stored here for a period of 2 years. After this time, the data will be moved to /tapearchive/seqdata
. See below for more information on /tapearchive
.
Do not copy data from this directory. For convenience, consider making a symlink to files stored here. By avoiding unnecessary copying, accidental data corruption or deletion can be avoided and space will be used more efficiently on the cluster and within directories you own.
Instead of copying data from /seqdata
, create a symlink.
A symlink, or symbolic link is a pointer to a file or directory that behaves like the real thing.
If you want to use data from /seqdata
as if it were in one of your project directories you can create a symlink like this:
ln -s /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025/ /home/FCAM/<user>/AwesomeProject/rawData/
Where /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025
is the raw data directory and /home/FCAM/<user>/AwesomeProject/rawData/
is the destination. You will then have a symlink, /home/FCAM/<user>/AwesomeProject/rawData/AwesomeWGS_May2025
that behaves just as if the files contained had been copied without taking up all the space.
/tmp
Quota: Shared space. No individual or group quotas.
Each node has it’s own local /tmp
directory which can be used for temporary storage of data while actively running analyses. This can be desirable when an analysis would be limited by I/O operations across the network.
/tmp
is a shared space. Once an analysis is done, nothing should be left behind here.
On our system /tmp
is pretty small and often fills up. Many programs quietly write to /tmp
by default without telling users (they usually quietly clean up afterward as well). If /tmp
fills up, no space left on device
errors frequently result, confusing users who know they still have plenty of space in their storage quota. If you are running a program that does this, say:
genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta
You can usually get it to write to another temporary directory by setting and exporting the variable TMPDIR
to somewhere else before running it:
export TMPDIR=myNewTmpDir # this directory must exist or an error will result
genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta
You can create your own temporary directory where you are doing the analysis, or you can use /local
.
/local
Quota: Shared space. No individual or group quotas.
This is another space local to each node and shared among users. It is larger than /tmp
but files should be removed immediately after analysis.
Archival spaces
These spaces are meant for archiving data for varying durations. Users can request the creation of directories they or a group of users will have read/write access to.
Each of the archival spaces have the same directory structure containing the following directories:
users | Owned by users |
labs | Shared by users of a PI’s lab |
projects | Shared by users belonging to multiple lab groups collaborating on a specific project |
departments | Shared by users of a department |
/archive
Quota: No current quota policy.
This system is intended for medium term storage of data that is no longer being used for analysis, but which may be needed within a year or so. /archive
gives relatively fast access (though not as fast as the above directories). The data are securely stored, being geospread across four data centers. As such, it is relatively expensive for us to store data here. Users should not plan on permanently archiving data here.
Eventually, data stored here will be moved to tape storage.
/tapebackup
Quota: No current quota policy.
This system is intended for short term storage (< 1 year). Think of it as a temporary storage space for files you don’t currently need, but are not ready to delete.
This system stores data on magnetic tape. Data stored here goes on a single magnetic tape and is not backed up, though magnetic tape is highly stable.
Access to magnetic tape is slow. Depending on how busy the system is, it may take hours or days for data to be written or retrieved. It is, however, inexpensive.
/tapearchive
Quota: No current quota policy.
This system is intended for long term storage. Completed projects and datasets that are no longer currently active can be stored here.
It also stores data on magnetic tape, but on two redundant copies in the same physical location. Like with /tapebackup
, access is slow.
This system also includes a seqdata
directory where data from /seqdata
are moved after a period of 2 years. Users may request that data be moved back to /seqdata
for a period of time if they need to actively the the data again for anlyses.
Summary table
Directory | Purpose | Access | Quota/Capacity | Notes |
---|---|---|---|---|
/home/FCAM/<user> |
Personal home directory for code, config, small data | Created for each user | 25GB | Permissions must remain private. For sharing, use /labs or /projects . |
/scratch |
Temporary storage for active analyses | Open to all users | 84TB (shared) | Files may be deleted after 90 days. |
/sandbox |
Temporary space | Same as /scratch |
Bigger than scratch | Files may be deleted after 90 days. |
/seqdata |
Central store for raw/original sequencing data | Read-only for users | NA | Use symlinks instead of copying. Contact CBC to store new data. |
/tmp |
Node-local temporary space | Shared per node | Small | May fill up unexpectedly. Set TMPDIR to redirect temp files. |
/local |
Larger node-local temp space | Shared per node | Larger than /tmp |
Clean up after use. |
/archive |
Medium-term storage (up to ~1 year) | Request via CBC, access varies | NA | Slower than active spaces, geospread, more expensive. |
/tapebackup |
Long-term storage (1+ years) | Request via CBC, access varies | NA | Stored on single magnetic tape, not redundant, very slow access. |
/tapearchive |
Very long-term or permanent storage | Request via CBC, access varies | NA | Redundant magnetic tape storage, very slow access. |
Quotas
To check current usage against quotas for /labs
and /projects
directories visit this link. You must be connected to the CAM VPN to load the site.
Restoring lost data
A snapshot of systems used for active work (not archival systems) is made each day. Snapshots are preserved for 10 days. You can access this backup to restore files and directories that were lost or corrupted.
To recover a file or directory from a snapshot:
- Navigate to the parent directory which contained the file or directory that you want to recover.
- Enter the snapshots directory with
cd .snapshot
. - See the available snapshots with
ls -l
. - Choose the desired snapshot date and
cd <name>
. - Copy the file you wish to recover to a location outside of the snapshot
cp <file> <destination>
.
The .snapshot
directory will not be visible in the output of the ls
command.