File systems and storage
Overview
The HPC offers a variety of systems for storage and archiving of data. With a few noted exceptions, all systems below are network-attached, meaning they are accessible from all compute and login nodes.
Data Storage Systems
Below is a summary of the primary locations in the cluster file system that users will interact with.
Active work spaces
The following spaces are suitable for use for active data analysis.
/home/FCAM/<user>
Quota: 25GB. Generally cannot be expanded.
All users will have a home directory when their account is created.
This location is primarily intended for storage of small files such as scripts, source code, configuration, software installations, and can accommodate a small amount of data.
Use other spaces for analysis of large datasets.
Users are not permitted to open permissions on their home directories. If you need to share files with other users, try any of the below directories you have access to. If you open permissions on your home directory, it may be locked down.
/labs
Quota: 2TB. can be increased upon request.
Individual users will always be associated with “lab” groups. PI of the lab group can request a space in /labs
.
This location is intended for storage of active shared project data to facilitate collaboration among members of a PI’s lab. You will need to be part of the PI’s permission group to access a lab directory.
/projects
Quota: 2TB, can be increased upon request.
This can be created at the request of one or more PIs.
This location is intended for storage of active shared project data to facilitate collaboration among multiple groups on a project.
/scratch
Quota: Shared space. No individual or group quotas.
Anyone can create any number of directories here and set permissions to whatever they choose.
This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. As of last update, /scratch
had 84TB of space. Please be considerate of other users of this shared resource.
In /scratch
unused files are subject to deletion without notice after 90 days.
/sandbox
Quota: Shared space. No individual or group quotas.
This system has identical policies to /scratch
. It is in development. At the time of writing (7/25) it was not up.
/seqdata
Quota: Shared space. No individual or group quotas. Users do not have write access.
This location is used for storing raw sequencing data and is intended to alleviate strain on quotas in other locations and to discourage data duplication. Data are stored as read-only. You may request that raw data be stored here (all data generated by CGI is stored here by default), but CBC will have to move it there for you, as users do not have write access.
If you would like to move existing raw data to this location or store data from an external sequencing center, please contact us at cbcsupport@helpspotmail.com
.
Do not copy data from this directory. For convenience, consider making a symlink to files stored here. By avoiding unnecessary copying, accidental data corruption or deletion can be avoided and space will be used more efficiently on the cluster and within directories you own.
Instead of copying data from /seqdata
, create a symlink.
A symlink, or symbolic link is a pointer to a file or directory that behaves like the real thing.
If you want to use data from /seqdata
as if it were in one of your project directories you can create a symlink like this:
ln -s /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025/ /home/FCAM/<user>/AwesomeProject/rawData/
Where /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025
is the raw data directory and /home/FCAM/<user>/AwesomeProject/rawData/
is the destination. You will then have a symlink, /home/FCAM/<user>/AwesomeProject/rawData/AwesomeWGS_May2025
that behaves just as if the files contained had been copied without taking up all the space.
/tmp
Quota: Shared space. No individual or group quotas.
Each node has it’s own local /tmp
directory which can be used for temporary storage of data while actively running analyses. This can be desirable when an analysis would be limited by I/O operations across the network.
/tmp
is a shared space. Once an analysis is done, nothing should be left behind here.
On our system /tmp
is pretty small and often fills up. Many programs quietly write to /tmp
by default without telling users (they usually quietly clean up afterward as well). If /tmp
fills up, no space left on device
errors frequently result, confusing users who know they still have plenty of space in their storage quota. If you are running a program that does this, say:
genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta
You can usually get it to write to another temporary directory by setting and exporting the variable TMPDIR
to somewhere else before running it:
export TMPDIR=myNewTmpDir # this directory must exist or an error will result
genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta
You can create your own temporary directory where you are doing the analysis, or you can use /local
.
The system will remove contents of this directory after TODO: how long or under what conditions?
/local
Quota: Shared space. No individual or group quotas.
This is another space local to each node and shared among users. It is larger than /tmp
but files should be removed immediately after analysis.
Archival spaces
These spaces are meant for archiving data for varying durations. Users can request the creation of directories they have read/write access to.
/archive
Quota: No current quota policy.
This system is intended for medium term storage of data that is no longer being used for analysis, but which may be needed within a year or so. /archive
gives relatively fast access (though not as fast as the above directories). The data are securely stored, being geospread across four data centers. As such, it is relatively expensive for us to store data here. Users should not plan on permanently archiving data here.
PIs and users can request storage space here in a directory structure that mirrors /home/FCAM
, /projects
and /labs
to which they have read/write access.
Eventually, data stored here will be moved to tape storage. TODO: Say a bit more about the goal of archive and tape. How safe is data here? Does it need to also be stored elsewhere such as SRA?
/tapebackup
Quota: No current quota policy.
This system is intended for short term storage (< 1 year). Think of it as a temporary storage space for files you don’t currently need, but are not ready to delete.
This system stores data on magnetic tape. Data stored here goes on a single magnetic tape and is not backed up, though magnetic tape is highly stable.
Access to magnetic tape is slow. Depending on how busy the system is, it may take hours or days for data to be written or retrieved. It is, however, inexpensive.
The directory structure is similar to /archive
. Users can request space here to which they have read/write access.
/tapearchive
Quota: No current quota policy.
This system is intended for long term storage. Completed projects and datasets that are no longer currently active can be stored here.
It also stores data on magnetic tape, but on two redundant systems. Like /tapebackup
, access is slow.
User access and directory structure are the same as /archive
and /tapebackup
.
Summary table
Directory | Purpose | Access | Quota/Capacity | Notes |
---|---|---|---|---|
/home/FCAM/<user> |
Personal home directory for code, config, small data | Created for each user | 25GB | Permissions must remain private. For sharing, use /labs or /projects . |
/labs |
Shared lab project space | PI request, lab group members | 2TB (increasable) | For intra-lab collaboration. |
/projects |
Shared inter-lab project space | PI request, project group members | 2TB (increasable) | For collaboration across labs. |
/scratch |
Temporary storage for active analyses | Open to all users | 84TB (shared) | Files may be deleted after 90 days. |
/sandbox |
Temporary space (in development) | Same as /scratch |
TBD | Not yet active as of 7/25. |
/seqdata |
Central store for raw/original sequencing data | Read-only for users | NA | Use symlinks instead of copying. Contact CBC to store new data. |
/tmp |
Node-local temporary space | Shared per node | Small | May fill up unexpectedly. Set TMPDIR to redirect temp files. |
/local |
Larger node-local temp space | Shared per node | Larger than /tmp |
Clean up after use. |
/archive |
Medium-term storage (up to ~1 year) | Request via CBC, access varies | NA | Slower than active spaces, geospread, more expensive. |
/tapebackup |
Long-term storage (1+ years) | Request via CBC, access varies | NA | Stored on single magnetic tape, not redundant, very slow access. |
/tapearchive |
Very long-term or permanent storage | Request via CBC, access varies | NA | Redundant magnetic tape storage, very slow access. |
Quotas
To check current usage against quotas for /labs
and /projects
directories visit this link. You must be connected to the CAM VPN to load the site.
Restoring lost data
A snapshot of systems used for active work (not archival systems) is made each day. Snapshots are preserved for 10 days. You can access this backup to restore files and directories that were lost or corrupted.
To recover a file or directory from a snapshot:
- Navigate to the parent directory which contained the file or directory that you want to recover.
- Enter the snapshots directory with
cd .snapshot
. - See the available snapshots with
ls -l
. - Choose the desired snapshot date and
cd <name>
. - Copy the file you wish to recover to a location outside of the snapshot
cp <file> <destination>
.
The .snapshot
directory will not be visible in the output of the ls
command.
Connecting to the file system
Users typically interact with the file systems by connecting to the Mantis cluster and using the command line. You can also map any of the file systems to your local computer and access them through your usual graphical user interface. You can drag-and-drop files to move or copy them or open them to edit in any software you use locally. One great use-case for this is to inspect BAM files using IGV (without having to download them).
To do this, you must first connect to the CAM VPN (instructions). Then:
- From a mac, you can select from your top dropdown menu in Finder
Go:Connect to Server
and entersmb://cfs09.cam.uchc.edu/home/FCAM/<username>
. When prompted, enter your CAM credentials (not your netID/password). - From Windows you can map a network filesystem using these directions and the address formatted like this:
\\cfs09.cam.uchc.edu\home\FCAM\<username>
The above instructions are for user home directories. They are located on file system cfs09
. Other directories, such as /scratch
and /seqdata
are on separate file systems. You will need to update the above address accordingly. To list the file systems of root directories, when connected via SSH to Mantis do:
df -h
Filesystem Size Used Avail Use% Mounted on
cfs12:/core 2.1P 1.5P 593T 72% /core
cfs09:/labs 2.3P 1.9P 453T 81% /labs
cfs08:/ifs/scratch 84T 54T 29T 66% /scratch
cfs09:/isg 82T 76T 6.5T 93% /isg
cfs09:/home/FCAM 2.3P 1.9P 453T 81% /home/FCAM
cfs15:/seqdata 728T 722T 6.1T 100% /seqdata
Some, such as /seqdata
only mount if you visit them and may not immediately appear using df
.