November 13th, 2020
Topic: Application of Genomics Big Data on the Cancer Cloud: Making use of difficult data
Presenter: Dr. John Torcivia, Director of AI Deployment, Clarifai, Inc., Department of Biochemistry, George Washington University
In many applications in genomics, large data sets are created and lightly used before being shared with other researchers (ideally) or simply tossed away on hard drives. The Cancer Cloud project has enabled some of this very large data to be shared among qualified researchers in order to facilitate a greater understanding of oncogenesis. One issue that continuously comes up, however, is that simply using the data requires specialized skills outside of the biological realm. A blend of computer science and biology is required in order to properly be able to access and appropriately run computations on data as it gets too big to scale. This presentation goes over an application on the ISB Cancer Cloud where whole genome sequencing was used to generate variant calls for downstream research. Due to the size of the whole genome sequences, this was cost prohibitive to do it on lab computers and had to be done in the cloud. Also due to the size of the data, custom processes needed to be put into place to manage and queue the computations as well as to parallelize and reconstruct them properly. This workflow has been made available open source for adaptation to other pipelines and the WGS variant data is being made available to qualified researchers in the cancer cloud.