Use Cases for Samples
The following use cases were developed by the Samples group.
- Biobanks should be able to crawl the BioSamples database to identify all the published (and searchable) datasets derived from samples they have provided.
- Public archives should be able to crawl Biobank websites, in order to identify samples that are known to have public accessions in the BioSamples database AND that can be made publicly available, and thereby link public samples to a provider (“where can I get more of this sample?”).
- In case of privacy or consent considerations, only the biobank should know what are the specific samples connected to publicly available datasets.
- Public archives should be able to crawl Biobank websites, in order to identify ‘sanitised’ sample metadata descriptions (again, in case of confidentiality or consent considerations). Biobanks remain responsible for ensuring only authorised metadata is visible, and can control access to restricted samples.
- Each sample provided by a biobank has an opaque pseudo-anonymous identifier that is assigned by the biobank to identify a specific sample (referred to hereafter as the “sample name”).
- Each sample reported in a public archive or used to generate a public dataset has a public, BioSamples database accession (hereafter called “sample identifier”).
- In some cases, a biobank may issue different sample identifiers when providing the same sample to different projects. This may result in duplicated sample accessions in the BioSamples database.
Given these use cases and assumptions, we will use Bioschemas to describe sample links. The main challenge is therefore the identification of links between sample identifiers (within Biobanks) and sample accessions (from the BioSamples database). This is not always possible without considerable additional curation effort, but of the 5 million samples in the BioSamples database, over 4 million declare either a ‘synonym’, ‘sample source name’ or ‘source name’ attribute, frequently used to encode the original biobank sample name. Exposing these in a structured manner through the BioSamples database would allow Biobanks to crawl and analyse this content, marrying sample that are recognised with their own internal identifiers.