Thank you for Subscribing to CIO Applications Weekly Brief
Best Practices of Clinical Bioinformatics and Data Science
Sarah Wang, Director, Bioinformatics & Data Science, CareDx Inc.
The core of any Bioinformatic engine for diagnostic purposes is in the clinical utility algorithm built with public or proprietary tools. The prime value of an algorithm is to minister the level of accuracy, sensitivity and robustness that is tailored to the specific need of a diagnostic test. Research and Development grade Bioinformatics pipelines can be considered successful with the core value being delivered at the level of an individual laboratory. On the other hand, clinical-grade Bioinformatics requires transparency, repeatability, reproducibility and robustness so as to ensure interoperability across multiple clinical laboratories.
Because of the regulated nature of clinical diagnostics, additional and critical features should be designed into the Bioinformatic system. These features are built around the core of the algorithm and jointly offer the immanent values of clinical-grade Bioinformatics as a whole. Some of the qualities to consider when designing the peripheral system of the clinical computation edifice are validation efficiency, operational robustness, computation scalability, utilization and platform adaptability, rapid results turnaround, and integration of effective quality control procedures. The skin that forms the shield and roots into the clinical Bioinformatics engine is the privacy, security and compliance component. Moreover, clinical Bioinformatic processes usually encompass or directly connect to the LIMS system, and the data storage and retrieval apparatus. We should set forth to design a system that keeps the operational cost-effective, not only for the capital investment of instruments and infrastructure, but also contemplating the maintenance and associated labor cost. The infrastructure should meet the short-term business goal but also should be sustainable to support specific longer-term business goals. Establishing a holistic view and evaluating benefits and risk factors of alternatives before development starts will save many detours and inefficient back tracks as a winning strategy for the long run.
Usually multiple features jointly deliver one quality value mentioned above and each feature simultaneously contributes to multiple quality values.
Though well recognized, to what extent and how the modularity can be designed into the Bioinformatics system could vary drastically among Bioinformatics professionals. For a Bioinformatics pipeline built on an HPC, all secondary analyses could be built into one module and all tertiary analyses into another module, then multiple threading is used to process all the samples within each sequencing run. For a pipeline in the cloud, it could be beneficial to break the secondary and tertiary modules into finer step-wise components, with each step as one module and assign a cloud instance for each sample. In this way, the success or result turnarounds of one sample is not dependent on other samples within the run.Further, processing time for each sample can be accelerated with multiple threads and/or higher configuration of a cloud instance as needed. To reduce the operational cost, the unused compute time of the designated cloud instances can be recycled. The implementation and validation process for upgrading tools, modifying parameters or algorithms, adding new product pipelines, debugging, and routine operation calibrations would be swift with minimal need of human capital. Another example is incorporating operational QC procedures into Bioinformatics processes, such as checking Cluster Densities in early sequencing cycles of NGS instruments to allow the earliest possible detection of experimental failures to allow early termination and revamp of experiments, thus saving both time and cost.
With AI/ML entering the clinical domain amalgamated with bioinformatics, transparent, modularized and deterministic algorithms should be focused on to secure sustainable early wins
While the validation efficiency can be achieved within an organization with thoughtful designs, there is undoubtedly great redundancy in validations across organizations: the same tools are often used by different organizations to perform the same tasks, same variants are being tackled by different organizations with different accuracy and sensitivity claims, and so on. There is great value in leveraging community efforts to improve the community-wide validation efficiency. Benchmarking variants, an initiative by NIST GIAB is a good example in this area. However, much more improvement is needed. For example, the orthogonality, comparability and quality of the benchmarking and proficiency data should be carefully evaluated. If there are three sets of data used, but two of them were from the same sequencing platform, the intrinsic errors in the redundant platform will contribute to the false positives and false negatives. Directionally synchronized, the FDA launched precision FDA (https://precision.fda.gov/) in 2016, a platform wherein the community can use for“NGS assay evaluation and regulatory science exploration”. But benchmarking Bioinformatics tools and pipelines is still lacking. A standardized set of data, optimally being synthesized from empirical data that has removed or balanced platform or chemistry biases from orthogonal resources, should be used to measure computational performance independent of assays. A set of community-accepted computation processes could be established as the standard for validation on common processes across the community. This endeavor would not only benefit research and operation for the members but also would assist regulatory agencies in formalizing oversight.
The regulated nature of clinical diagnostics sets high-level requirements on transparency and reproducibility on Bioinformatic software. This could be challenging with AI (machine learning) entering the clinical domain amalgamated with Bioinformatics. Certain principles need be taken to secure early sustainable wins. Fully transparent, finely modularized and deterministic algorithms should be the focus on the first pass.
The uniqueness of Bioinformatics resides in its multiple disciplinary integration, powerful signal extraction from massive amounts of noisy data and integration of data science. The clinical diagnostic Bioinformatic engine exists within a regulated environment. With both Bioinformatics and regulatory sciences evolving, it will take more than just traditional domain knowledge of one field to move the area forward at a timely pace to benefit patients in near future.