File systems and storage

Overview

The HPC offers a variety of systems for storage and archiving of data. With a few noted exceptions, all systems below are network-attached, meaning they are accessible from all compute and login nodes.

Data Storage Systems

Below is a summary of the primary locations in the cluster file system that users will interact with.

Active work spaces

The following spaces are suitable for use for active data analysis.

/home/FCAM/<user>

Quota: 25GB. Generally cannot be expanded.

All users will have a home directory when their account is created.

This location is primarily intended for storage of small files such as scripts, source code, configuration, software installations, and can accommodate a small amount of data.

Use other spaces for analysis of large datasets.

Note

Users are not permitted to open permissions on their home directories. If you need to share files with other users, try any of the below directories you have access to. If you open permissions on your home directory, it may be locked down.

/labs

Quota: 2TB. can be increased upon request.

Individual users will always be associated with “lab” groups. PI of the lab group can request a space in /labs.

This location is intended for storage of active shared project data to facilitate collaboration among members of a PI’s lab. You will need to be part of the PI’s permission group to access a lab directory.

/projects

Quota: 2TB, can be increased upon request.

This can be created at the request of one or more PIs.

This location is intended for storage of active shared project data to facilitate collaboration among multiple groups on a project.

/scratch

Quota: Shared space. No individual or group quotas.

Anyone can create any number of directories here and set permissions to whatever they choose.

This location is intended to serve as temporary storage. This would be an ideal place for the output of intermediate files during analysis. As of last update, /scratch had 84TB of space. Please be considerate of other users of this shared resource.

Warning

In /scratch unused files are subject to deletion without notice after 90 days.

/sandbox

Quota: Shared space. No individual or group quotas.

This system has identical policies to /scratch. It is in development. At the time of writing (7/25) it was not up.

/seqdata

Quota: Shared space. No individual or group quotas. Users do not have write access.

This location is used for storing raw sequencing data and is intended to alleviate strain on quotas in other locations and to discourage data duplication. Data are stored as read-only. You may request that raw data be stored here (all data generated by CGI is stored here by default), but CBC will have to move it there for you, as users do not have write access.

If you would like to move existing raw data to this location or store data from an external sequencing center, please contact us at cbcsupport@helpspotmail.com.

Do not copy data from this directory. For convenience, consider making a symlink to files stored here. By avoiding unnecessary copying, accidental data corruption or deletion can be avoided and space will be used more efficiently on the cluster and within directories you own.

Tip

Instead of copying data from /seqdata, create a symlink.

A symlink, or symbolic link is a pointer to a file or directory that behaves like the real thing.

If you want to use data from /seqdata as if it were in one of your project directories you can create a symlink like this:

ln -s /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025/ /home/FCAM/<user>/AwesomeProject/rawData/

Where /seqdata/CGI/Fastq_Files/AwesomeWGS_May2025 is the raw data directory and /home/FCAM/<user>/AwesomeProject/rawData/ is the destination. You will then have a symlink, /home/FCAM/<user>/AwesomeProject/rawData/AwesomeWGS_May2025 that behaves just as if the files contained had been copied without taking up all the space.

/tmp

Quota: Shared space. No individual or group quotas.

Each node has it’s own local /tmp directory which can be used for temporary storage of data while actively running analyses. This can be desirable when an analysis would be limited by I/O operations across the network.

/tmp is a shared space. Once an analysis is done, nothing should be left behind here.

Tip

On our system /tmp is pretty small and often fills up. Many programs quietly write to /tmp by default without telling users (they usually quietly clean up afterward as well). If /tmp fills up, no space left on device errors frequently result, confusing users who know they still have plenty of space in their storage quota. If you are running a program that does this, say:

genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta

You can usually get it to write to another temporary directory by setting and exporting the variable TMPDIR to somewhere else before running it:

export TMPDIR=myNewTmpDir # this directory must exist or an error will result

genomeAssembler -i mySequences.fastq.gz -o myGenome.fasta

You can create your own temporary directory where you are doing the analysis, or you can use /local.

The system will remove contents of this directory after TODO: how long or under what conditions?

/local

Quota: Shared space. No individual or group quotas.

This is another space local to each node and shared among users. It is larger than /tmp but files should be removed immediately after analysis.

Archival spaces

These spaces are meant for archiving data for varying durations. Users can request the creation of directories they have read/write access to.

/archive

Quota: No current quota policy.

This system is intended for medium term storage of data that is no longer being used for analysis, but which may be needed within a year or so. /archive gives relatively fast access (though not as fast as the above directories). The data are securely stored, being geospread across four data centers. As such, it is relatively expensive for us to store data here. Users should not plan on permanently archiving data here.

PIs and users can request storage space here in a directory structure that mirrors /home/FCAM, /projects and /labs to which they have read/write access.

Eventually, data stored here will be moved to tape storage. TODO: Say a bit more about the goal of archive and tape. How safe is data here? Does it need to also be stored elsewhere such as SRA?

/tapebackup

Quota: No current quota policy.

This system is intended for short term storage (< 1 year). Think of it as a temporary storage space for files you don’t currently need, but are not ready to delete.

This system stores data on magnetic tape. Data stored here goes on a single magnetic tape and is not backed up, though magnetic tape is highly stable.

Access to magnetic tape is slow. Depending on how busy the system is, it may take hours or days for data to be written or retrieved. It is, however, inexpensive.

The directory structure is similar to /archive. Users can request space here to which they have read/write access.

/tapearchive

Quota: No current quota policy.

This system is intended for long term storage. Completed projects and datasets that are no longer currently active can be stored here.

It also stores data on magnetic tape, but on two redundant systems. Like /tapebackup, access is slow.

User access and directory structure are the same as /archive and /tapebackup.

Summary table

Directory Purpose Access Quota/Capacity Notes
/home/FCAM/<user> Personal home directory for code, config, small data Created for each user 25GB Permissions must remain private. For sharing, use /labs or /projects.
/labs Shared lab project space PI request, lab group members 2TB (increasable) For intra-lab collaboration.
/projects Shared inter-lab project space PI request, project group members 2TB (increasable) For collaboration across labs.
/scratch Temporary storage for active analyses Open to all users 84TB (shared) Files may be deleted after 90 days.
/sandbox Temporary space (in development) Same as /scratch TBD Not yet active as of 7/25.
/seqdata Central store for raw/original sequencing data Read-only for users NA Use symlinks instead of copying. Contact CBC to store new data.
/tmp Node-local temporary space Shared per node Small May fill up unexpectedly. Set TMPDIR to redirect temp files.
/local Larger node-local temp space Shared per node Larger than /tmp Clean up after use.
/archive Medium-term storage (up to ~1 year) Request via CBC, access varies NA Slower than active spaces, geospread, more expensive.
/tapebackup Long-term storage (1+ years) Request via CBC, access varies NA Stored on single magnetic tape, not redundant, very slow access.
/tapearchive Very long-term or permanent storage Request via CBC, access varies NA Redundant magnetic tape storage, very slow access.

Quotas

To check current usage against quotas for /labs and /projects directories visit this link. You must be connected to the CAM VPN to load the site.

Restoring lost data

A snapshot of systems used for active work (not archival systems) is made each day. Snapshots are preserved for 10 days. You can access this backup to restore files and directories that were lost or corrupted.

To recover a file or directory from a snapshot:

  1. Navigate to the parent directory which contained the file or directory that you want to recover.
  2. Enter the snapshots directory with cd .snapshot.
  3. See the available snapshots with ls -l.
  4. Choose the desired snapshot date and cd <name>.
  5. Copy the file you wish to recover to a location outside of the snapshot cp <file> <destination>.
Note

The .snapshot directory will not be visible in the output of the ls command.

Connecting to the file system

Users typically interact with the file systems by connecting to the Mantis cluster and using the command line. You can also map any of the file systems to your local computer and access them through your usual graphical user interface. You can drag-and-drop files to move or copy them or open them to edit in any software you use locally. One great use-case for this is to inspect BAM files using IGV (without having to download them).

To do this, you must first connect to the CAM VPN (instructions). Then:

  • From a mac, you can select from your top dropdown menu in Finder Go:Connect to Server and enter smb://cfs09.cam.uchc.edu/home/FCAM/<username>. When prompted, enter your CAM credentials (not your netID/password).
  • From Windows you can map a network filesystem using these directions and the address formatted like this: \\cfs09.cam.uchc.edu\home\FCAM\<username>

The above instructions are for user home directories. They are located on file system cfs09. Other directories, such as /scratch and /seqdata are on separate file systems. You will need to update the above address accordingly. To list the file systems of root directories, when connected via SSH to Mantis do:

df -h
Filesystem               Size  Used Avail Use% Mounted on
cfs12:/core              2.1P  1.5P  593T  72% /core
cfs09:/labs              2.3P  1.9P  453T  81% /labs
cfs08:/ifs/scratch        84T   54T   29T  66% /scratch
cfs09:/isg                82T   76T  6.5T  93% /isg
cfs09:/home/FCAM         2.3P  1.9P  453T  81% /home/FCAM
cfs15:/seqdata           728T  722T  6.1T 100% /seqdata

Some, such as /seqdata only mount if you visit them and may not immediately appear using df.