Podcast cover for "Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows" by Daniel Mas Montserrat et al.
Episode

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

Nov 20, 20257:21
Distributed, Parallel, and Cluster ComputingArtificial IntelligenceMachine LearningGenomics
No ratings yet

Abstract

Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.

Summary

This paper addresses the challenge of efficiently managing memory in large-scale genomic workflows used in precision medicine. These workflows often involve processing datasets that can reach hundreds of gigabytes per sample, leading to memory overruns, intensive disk I/O, and task failures. The authors propose a suite of methods for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics tasks, aiming to optimize resource utilization and reduce overall execution time. The core idea is to divide the genomic workload by chromosome and parallelize the processing across compute resources, avoiding the need to load the entire genome into memory at once. The authors present three complementary systems: a static scheduler that optimizes chromosome processing order to minimize peak memory consumption under a fixed concurrency budget; a dynamic scheduler that adaptively predicts RAM usage using polynomial regression and employs a Knapsack-style algorithm to batch jobs optimally; and a RAM prediction module based on symbolic regression that estimates per-chromosome memory consumption based on input characteristics. The symbolic regression model is distilled from an ensemble of tree-based regressors, providing a lightweight and interpretable solution. The methods are evaluated using simulations and real-world genomic pipelines, demonstrating their ability to balance load, curb memory overruns, and reduce end-to-end execution time. The integration of these approaches into a clinical polygenic risk score (PRS) pipeline resulted in a significant reduction in wall-clock time and compute costs.

Key Insights

  • The static scheduler, by optimizing chromosome processing order, achieved up to a 40% decrease in peak RAM usage compared to sequential processing, particularly for low concurrency values (K).
  • The dynamic scheduler, using a Knapsack packing strategy, consistently outperformed a Greedy algorithm in minimizing makespan, highlighting the importance of maximizing RAM utilization over simply the number of jobs.
  • Adding a percentile bias to the Polynomial Regression (LR) predictor in the dynamic scheduler resulted in a 38% average decrease in overcommitments, while also slightly decreasing the makespan, demonstrating the effectiveness of conservative memory allocation.
  • Initializing the dynamic scheduler's predictor with the "Smallest First" chromosome processing order led to the lowest overall makespan compared to other initialization strategies, suggesting that quickly completing the initial sequential phase is more beneficial than having a more accurate initial predictor.
  • Incorporating prior information about RAM usage, even if noisy, into the dynamic scheduler eliminated the need for sequential predictor initialization, resulting in a significant decrease in makespan, especially for task sizes below 50% of total RAM.
  • The symbolic regression model, distilled from tree-based ensembles, achieved a Pearson correlation of 0.85 with ground truth RAM usage, only slightly lower than the 0.92 achieved by the tree-based ensembles, while offering the advantage of easy deployment. The distilled model equation is provided.
  • Conformal prediction, used with the symbolic regression, provided a conservative estimate of RAM usage ensuring safer job scheduling without overcommitments.

Practical Implications

  • These methods have direct applications in precision medicine workflows, enabling more efficient and cost-effective analysis of large-scale genomic data.
  • Bioinformatics engineers and researchers can leverage these techniques to optimize resource allocation in their pipelines, reducing memory overruns, improving CPU utilization, and decreasing overall execution time.
  • The symbolic regression approach provides a practical way to deploy RAM prediction models without the overhead of complex machine learning infrastructure. The closed-form equation generated can be easily integrated into existing workflow management systems.
  • The findings suggest that dynamic schedulers with adaptive memory prediction and Knapsack packing strategies are more effective than static or greedy approaches for managing memory in genomic workflows.
  • Future research could explore more fine-grained parallelization strategies, such as simultaneously processing subsequences within a chromosome, to further improve resource utilization and reduce execution time.

Links & Resources

Authors

Cite This Paper

Year:2025
Category:cs.DC
APA

Montserrat, D. M., Verma, R., Barrabés, M., Vega, F. M. d. l., Bustamante, C. D., Ioannidis, A. G. (2025). Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows. arXiv preprint arXiv:2511.15977.

MLA

Daniel Mas Montserrat, Ray Verma, Míriam Barrabés, Francisco M. de la Vega, Carlos D. Bustamante, and Alexander G. Ioannidis. "Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows." arXiv preprint arXiv:2511.15977 (2025).