This section contains a step-by-step guide towards data sharing as well as the various services available to support this, and their respective strengths. Most are available to Neuro members at no cost with minimal implementation or technical knowledge. A summary table completes this section as well as example use-cases.
1. Identify what to share and how
It should first be decided what data to share and under which degree of restrictions. Researchers should decide whether to share raw or processed data, based on what is needed to reproduce published findings, what would be most useful to other researchers, and based on practical considerations (e.g., data size).
Additionally, ethical considerations are paramount in the case of data from human participants. Generally, data may only be shared with participants’ consent, as approved by the Research Ethics Board overseeing the project. The consent forms and approved protocols will provide further information regarding how the data may be accessed. The data will fall in one of three access categories:
-
Fully Open Access: no restrictions and minimal conditions are set to accessing the data.
-
Registered Access: enables the owner to define a set of conditions, such as licenses the requester must agree to. This may for example include anonymized, non-identifiable clinical data.
-
Controlled Access: the strictest tier used most often for sensitive human data (e.g. genomics). Requests are reviewed by a committee before sharing.
A single dataset may contain different data types made available under different modes of access. The mode(s) of access needed for your data will affect your choice of a data repository (Step 2).
2. Identify a repository to share the data
Identifying an appropriate repository can be done using the list below, with the access type determined in Step 1 and based on the specifics of the dataset (e.g., size).
The focus is given here to services available to Neuro researchers, with servers located in Canada as is often a requirement for the sharing of human data. When ethical and privacy considerations are not a concern, for instance in the case of data from animal models, several additional options are available (Zenodo, Figshare, Dryad, etc).
Resources to learn more about data repositories: Re3data.org, FAIRsharing and The Harvard data management repository comparison matrix.
|
The Canadian Open Neuroscience Platform (CONP) is Brain Canada funded platform created to support the sharing of Neuroscience datasets. |
|
The Federated Research Data Repository (FRDR) is the Digital Research Alliance of Canada data-sharing and preservation platform. |
|
The McGill University Dataverse is a McGill Library service for data sharing based on the Open Source Dataverse project and connected with other Canadian Universities repository through Borealis. |
|
The Clinical Biospecimen, Imaging and Genetic repository is a platform and patient registry offering biobanking services, equipment samples and patient-information processing and management. |
The following table presents each service in greater detail, including the storage size, type of data and dataset accepted.
|
Data size |
Dataset characteristics |
Access |
Additional information |
CONP [request] |
50Gb, up to a few Tb with approval |
Static, Publication-ready |
Open |
Can also share tools and pipelines; Integrated with LORIS; Datalad1 support. Ideal for larger datasets deposited by users with technical knowledge. |
FRDR [request] |
1Tb, up to a few Tb with approval |
Static, Publication-ready |
Open |
Allows “collections” for grouping datasets; Globus2 support; DOI generation. Great for larger datasets deposited by users with no technical knowledge. |
Dataverse [request] |
Default of 20Gb |
Static, Publication-ready |
Open and Registered |
User-friendly sharing platform; DOI generation. Excellent for sharing smaller datasets to the widest audience. |
C-BIG |
Project specific |
Dynamic, Ongoing cohorts, Sharing while collecting |
Any |
Biobanking; ready-made consent form and ethics approval. Tiered-access model. Optimal for sharing patient-derived data from ongoing cohorts and during the course of the project, in a tiered access model. |
1Datalad is a framework that allows a dataset to be distributed (i.e. aggregated from multiple places) but shared through a single link.
2Globus is a software that allows non-technical users to perform easy and optimized data transfer between supported entities such as DRAC
3. Prepare the dataset
Dataset preparation involves multiple steps that will vary depending on the context of the research. Generally speaking, these will include:
-
De-identification of the data: Data from human participants should be stripped of any direct identifiers, or potentially identifying features (e.g., face from MRI scans). Indirect identifiers, especially when present in combinations, might lead to re-identification. They should be evaluated carefully, and modified or removed if needed. For further information on de-identification, see CONP’s guide.
-
Add information about the data (meta-data) by creating data descriptors and documentation (meta-data): Data dictionary, “README” file, etc.
-
Organize and convert data files according to modality-specific data standards, when applicable (e.g. Brain Imaging Data Structure). Some harmonization tools exist to make this process easier (e.g. Neurobagel for Neuroimaging and clinical data)
-
In line with the FAIR principles, shared data should be in open formats as much as possible. In some cases, this will include providing data in multiple formats, so that it can be processed by computers and used by people.
4. Deposit the data
Depositing data in your chosen repository is generally straightforward, through a process varying between repositories. Common steps are:
-
Entering additional high-level meta-data, enhancing findability.
-
Choosing a license for your data which ensures clarity on what others can and can’t do with it. Depending on the details of participant consent form and the approved research ethics protocol, you may need to write a custom license with a specific set of conditions.
The most commonly used licenses in the context of open research data are the Creative Commons:
-
CC0: This license lets others distribute, remix, adapt, and build upon your work, even commercially, without violating copyright.
-
CC BY: This license lets others distribute, remix, adapt, and build upon your work, even commercially, as long as they credit you for the original creation.
-
CC BY-NC: This license lets others remix, adapt, and build upon your work non-commercially, as long as they credit you for the original creation. Derivative works, however, may be distributed under different terms.
-
CC BY-SA: This license lets others remix, adapt, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms.
-
CC BY-NC-SA: This license lets others remix, adapt, and build upon your work non-commercially, as long as they credit you and license their new creations under identical terms.
Creative Commons has created a tool to support users new to licenses. For more information about Creative Commons licenses, contact McGill’s copyright [at] mcgill.ca (Copyright and Digital Collections Librarian).
5. Cite your data and get cited
Research data should be recognized as a valuable output and cited appropriately, much like a publication. Proper data citation ensures you receive credit for your work and acknowledges others when reusing open data. When sharing a dataset supporting a publication, cite it directly where relevant in the publication, include it in the references list, and link to it in the Data Availability Statement (if applicable). Citing data in the references and including a persistent identifier ensures proper aggregation of citations.
When reusing open datasets, always cite the dataset itself in the reference list, including its persistent identifier (e.g., DOI). Avoid citing only the publication describing the dataset; cite the dataset directly.
Data sharing: the don'ts
-
Including a statement such as “Data available upon reasonable request” in a published article is not considered data sharing best practices. Such requests are rarely successful (see Gabelica, 2022), and the availability of data is tied to the researcher/labs that generated it.
-
Sharing data by providing links to common cloud services not made for long term storage and sharing (e.g., Google Drive, Dropbox), is not best practice. These services do not provide a persistent identifier for datasets and are not indexed by data search engines. Datasets stored that way can be moved, deleted or modified at any time by the data owner, breaking the link and losing access.
-
GitHub is great to collaborate on code/software but is not meant to share data. It does not provide a persistent identifier.
About this document
Unless otherwise indicated, all content on these pages is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Please attribute it to TOSI (the Tanenbaum Open Science Institute), this web page, and the contributors listed below.