Researcher Spotlight
Learn about notable advances in our single-cell science research with the Boston Children’s Hospital (BCH) community and beyond. We are committed to advancing single-cell science by making it accessible and empowering other scientists to dig deeper into their data, one cell at a time.
- cdnadmin
Collecting Metadata In Large-Scale Projects
A case study of the HCA integrated gut cell atlas
Introduction
Human experiments can have complex designs and a great deal of clinical covariates that affect analysis. Collecting these covariates can be a difficult process because they span multiple experimental levels, from experimental and analytical methods to donor and sampling information. This is especially true in single-cell experiments, where thousands of cells are collected per sample, and sometimes multiple samples are collected per individual.
Here, we use the term metadata to refer to any variable related to the dataset, participant, sample, cell or gene except the core cell x gene expression matrix. It is critical to organize these metadata as complexity increases, and if we want to bring in published data for comparison, we need to create or follow a standard metadata format to be able to make comparisons. When analyzing published data, required metadata fields are sometimes unavailable, so it is necessary to reach out to the authors of published data. When reaching out, providing an online standardized metadata sheet improves accuracy, skips redundant data wrangling steps, and enhances collaboration.
Here I will explain methods I have used to make metadata collection and organization feasible in an atlas-level using Google Sheets. We use this collection process in the HCA integrated gut cell atlas project.
Design of Metadata
The HCA provides a required metadata schema that also includes metadata fields required by CELLxGENE. The HCA metadata guidelines are highly detailed (https://data.humancellatlas.org/metadata). The HCA follows similar guidelines to CELLxGENE, which is more simply detailed here (https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data) in terms of single-cell data analysis. We adhere to both schema to make dataset upload to the portals easier. The following portals increase public data accessibility and offer unique methods.
- CZI: CELLxGENE.cziscience.com: A highly optimized/efficient app for working with single-cell data
- HCA: data.humancellatlas.org: A platform for storing data for use in meta-atlases
- CAP: celltype.info: An analysis app similar to CELLxGENE with a focus on crowd-sourcing celltype annotations
We create separate google sheets for defining and collecting tier 1 (common and public) and tier 2 (patient-protected and tissue-specific) metadata. The google sheets for Tier 2 metadata can then be downloaded while preserving formatting so they can be filled offline in a secure location. Tier 1 metadata is collected on the Google Sheets to make collection more collaborative.
First, a metadata definitions file is created with detailed descriptions of each of the metadata fields, as well as dropdown menu options for fields which can be restricted to categories. Here is an example sheet.
https://docs.google.com/spreadsheets/d/1Jz02P7ZnqaigofvXVOxrYJh9bhvHa7nY-4EdJrt4BRA/edit?gid=0#gid=0
We use a custom python library (available on our github: https://github.com/CellDiscoveryNetwork/MetaManager) to then generate empty Google Sheets with the custom formatting. This library contains functions for generating Google Sheets and reading whatever is entered into these sheets into python data frames.
Once the Google Sheets are made, we share these with all of the members of the experiment, because the people who gather the donor-level information might have been different than those who decided experimental or analytical (such as alignment parameters) methods. Once the Google Sheets are created, we can start setting up the tracking method.
With the python library, we can use pattern matching or other validation methods in python to determine whether the metadata was filled out correctly or not. We then visualize the results on a heatmap, where each row corresponds to a dataset or experimental protocol and each column corresponds to a metadata field.
One heatmap is made per metadata level, for example:
In the Gut Cell Atlas project, we publish these heatmaps on a public website generated by Google Cloud https://hca_gut_cell_atlas.storage.googleapis.com/metadata_correctness.html which we then share with all of our contributors. We have set up a server that downloads the metadata from the google sheets automatically and updates these heatmaps every couple minutes, so that we can give feedback on metadata entry in real time.
In conclusion, we have created restricted metadata entry sheets which can be collaboratively filled, and we have a method for checking whether these were filled out correctly. Google Sheets also has built-in version control, so if anything goes awry, there’s a rewind button. Being online and available to all collaborators, Google Sheets helps prevent siloing of bioinformaticians, medical staff, and experimentalists and lengthy email exchanges about correcting metadata, and if doing meta-analysis, it streamlines large-scale collection of metadata from collaborators.