Mango User Guide¶

Introduction¶

Mango is a distributed visualization tool that enables visualization of genomic data on top of Apache Spark.

The Mango/Big Data Genomics Ecosystem¶

Mango builds upon the open source Apache Spark, Apache Avro, and Apache Parquet projects. Additionally, Mango can be deployed for both interactive and production workflows using a variety of platforms.

Installation

API

Mango Pyspark API Documentation
- bdgenomics.mango Package
  - Alignments
  - Coverage
  - Features
  - Variants
  - Genotypes
Jupyter Widget API Documentation
- bdgenomics.mango.pileup Package

Supported File Types

Supported Files

Running with Docker

Running Mango from Docker

Google Cloud

Amazon EMR

Development

Development notes for the Mango Browser
- Debugging the Mango browser frontend

References¶

Massie, Matt, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, and David A Patterson. 2013. “ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing.” UCB/EECS-2013-207, EECS Department, University of California, Berkeley.

Morrow, Alyssa, Anthony Joseph, and Nir Yosef. 2017. “Distributed Visualization for Genomic Analysis.” UCB/EECS-2017-82, EECS Department, University of California, Berkeley.

McKenna, Aaron, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, et al. 2010. “The Genome Analysis Toolkit: A MapReduce Framework for Analyzing Next-Generation DNA Sequencing Data.” Genome Research 20 (9). Cold Spring Harbor Lab: 1297–1303.

Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. “Dremel: Interactive Analysis of Web-Scale Datasets.” Proceedings of the VLDB Endowment 3 (1-2). VLDB Endowment: 330–39.

Nothaft, Frank A, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, et al. 2015. “Rethinking Data-Intensive Science Using Scalable Analytics Systems.” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM.

Schadt, Eric E, Michael D Linderman, Jon Sorenson, Lawrence Lee, and Garry P Nolan. 2010. “Computational Solutions to Large-Scale Data Management and Analysis.” Nature Reviews Genetics 11 (9). Nature Publishing Group: 647–57.

Vanderkam, Dan, B. Arman Aksoy, Isaac Hodes, Jaclyn Perrone, and Jeff Hammerbacher. 2016. “pileup.js: a JavaScript library for interactive and in-browser visualization of genomic data.” In Bioinformatics, 32 (15).

Vavilapalli, Vinod Kumar, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, et al. 2013. “Apache Hadoop YARN: Yet Another Resource Negotiator.” In Proceedings of the Symposium on Cloud Computing (SoCC ’13), 5. ACM.

Zaharia, Matei, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing.” In Proceedings of the Conference on Networked Systems Design and Implementation (NSDI ’12), 2. USENIX Association.