What is BioCompute?
Tremendous insights can be found in genome data, and many of these insights are being used to drive personalized medicine. But the hundreds of millions of reads that come from a gene sequencer represent small, nearly random fragments of the genome that’s being sequenced, and there are countless ways in which that data can be transformed to yield insights into cancer, ancestry, microbiome dynamics, metagenomics, and many other areas of interest.
Because there are so many different platforms and so many different scripts and tools to analyze genome data, there is a great need to standardize the way in which these steps are communicated. The more analysis steps and the more complicated a pipeline, the greater the need for a standardized mechanism of communication. The BioCompute standard brings clarity to an analysis, making it clear and reproducible.
A BioCompute Object (BCO) is an instance of the BioCompute standard, and is a computational record of a bioinformatics pipeline. A BCO is not an analysis, but is a record of which analyses were executed and in exactly which ways. In this way, a BCO acts as an interface for existing standards. A BCO contains all of the necessary information to repeat an entire pipeline from FASTQ to result, and includes additional metadata to identify provenance and usage.
How is BioCompute Different from Workflow Languages?
BioCompute is not just a workflow language. The independent nature of laboratories means that data and data processing are very difficult to share or reproduce, substantially limiting their utility. More than 250 “workflow languages” have been written to help bridge the gaps between computational environments, including WDL, Snakemake, Nextflow, and Common Workflow Language (CWL). These languages are great resources for managing a workflow for reimplementation, and have been recognized as critical components for turning data into scientific insights.
While BioCompute, using workflow languages, does retain computational environment, dependency, and parametric data for computational readability to ease re-implementation, it is primarily designed to fill the descriptive space using Usability Domain, and additional key components such as Error Domain and Verification Kit. BioCompute Objects (“BCOs” – computational analysis reports written according to the BioCompute standard) are meant to be human and machine readable, and include a mechanism to describe the computational steps, assign attribution to analysis authors, editors, reviewers, and others, and a free text field for describing purpose and context. So when a computational analysis is submitted to a journal or regulatory authority, it is a standalone object that can be completely understood in its entirety, and using the validation kit can be also run.
BCOs bridge the gap between computational instructions and human understanding. Every workflow has inherent organizational and language idiosyncrasies or customs, and BCOs make conceptual understanding much easier without getting lost in these details. Because BCOs are also machine readable, the data presentation can be easily formatted to focus on relevant information, compare two BCOs, evaluate inherent error in pipelines, and more. For this reason, BCOs also substantially help to reduce organizational burden.
BCOs have a built in “Error Domain” for recording the limits of the pipeline as part of the “Verification Kit.” The Verification Kit makes it very easy check that input data is within the range that the pipeline is capable of working with, and is another way that BCOs make it easier to understand the overall logical flow.
To demonstrate the power that a descriptive component can add, BioCompute partnered with other standards to create examples of joint reports using Common Workflow Language and Nextflow/Research Objects (tutorial for the latter can be found here).
BioCompute was built through a collaboration between The George Washington University and the FDA to improve communication of bioinformatics pipelines, and has since been expanded and refined through the participation or collaboration of hundreds of participants from throughout the public and private sectors. While we welcome interest and membership from anyone, most users will fall into one of three categories:
The Biocompute standard can help substantially improve replicability, making it possible to repeat a pipeline on a different sample with high fidelity and high confidence.
As BioCompute Objects become tested and validated, they can be applied in the clinic to identify risk factors, flag pharmakogenetic information, and much more.
Pharma, Biotech and Regulatory Pipeline
Protracted communications with the FDA can extend the review process by months. A standardized method of communicating HTS data may help repeat results more quickly and without the need for additional communication.
Research, clinical, and regulatory groups are key drivers of personalized medicine that is based on next generation sequencing, but there are barriers between these groups. BioCompute reduces these hurdles and brings transparency to the workflow, making it more clear what was done, and clearly delineating expectations for data sharing. The BioCompute specification can be layered with other privacy and security protocols to guard sensitive data, or be made open source depending on the needs of the user.
The BioCompute project has generated two publications, three workshops, FDA funding, contributions from over 300 participants, and FDA submissions. The project has worked with individuals from NIH, Harvard, several biotech and pharma companies, EMBL-EBI, Galaxy Project, and many more, and can be integrated with any existing standard for HTS data. The project is expected to be both an IEEE and ISO recognized standard within 8-10 months.
More information about The current BioCompute standard can be found on the Open Science Foundation website (where the standard is developed and maintained), the HIVE website, and the Research Objects discussion of BioCompute.
Milestones in the BioCompute Program
The major milestones of the BioCompute Partnership and future goals are paving the way for a consensus-driven, widely adopted standard. The FDA’s Genomics Working Group (GWG) originally articulated the challenges of communicating genomic analysis pipelines in a regulatory context in 2013. Since then, the project has accumulated tremendous momentum, a testament to the GWG’s efforts in describing communication challenges. More recently, the second BioCompute publication has recently been published, the 4th Workshop is scheduled, and the next major goal is the formal launch of the BioCompute Public Private Partnership. The Executive Committee will formalize the future roadmap beyond these goals.