the bioinformatics chat

70 Episodes

Reverse

#70 Prioritizing drug target genes with Marie Sadler

2023-12-2152:202

In this episode, Marie Sadler talks about her recent Cell Genomics paper, Multi-layered genetic approaches to identify approved drug targets. Previous studies have found that the drugs that target a gene linked to the disease are more likely to be approved. Yet there are many ways to define what it means for a gene to be linked to the disease. Perhaps the most straightforward approach is to rely on the genome-wide association studies (GWAS) data, but that data can also be integrated with quantitative trait loci (eQTL or pQTL) information to establish less obvious links between genetic variants (which often lie outside of genes) and genes. Finally, there’s exome sequencing, which, unlike GWAS, captures rare genetic variants. So in this paper, Marie and her colleagues set out to benchmark these different methods against one another. Listen to the episode to find out how these methods work, which ones work better, and how network propagation can improve the prediction accuracy. Links: Multi-layered genetic approaches to identify approved drug targets (Marie C. Sadler, Chiara Auwerx, Patrick Deelen, Zoltán Kutalik) Marie on GitHub Interview with Mariana Mamonova, the Ukrainian marine infantry combat medic who spent 6 months in russian captivity while pregnant Thank you to Jake Yeung, Michael Weinstein, and other Patreon members for supporting this episode.

#69 Suffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik Kempa

2023-09-2956:46

Today on the podcast we have Tomasz Kociumaka and Dominik Kempa, the authors of the preprint Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space. The suffix array is one of the foundational data structures in bioinformatics, serving as an index that allows fast substring searches in a large text. However, in its raw form, the suffix array occupies the space proportional to (and several times larger than) the original text. In their paper, Tomasz and Dominik construct a new index, δ-SA, which on the one hand can be used in the same way (answer the same queries) as the suffix array and the inverse suffix array, and on the other hand, occupies the space roughly proportional to the gzip’ed text (or, more precisely, to the measure δ that they define — hence the name). Moreover, they mathematically prove that this index is optimal, in the sense that any index that supports these queries — or even much weaker queries, such as simply accessing the i-th character of the text — cannot be significantly smaller (as a function of δ) than δ-SA. Links: Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space (Dominik Kempa, Tomasz Kociumaka) Thank you to Jake Yeung and other Patreon members for supporting this episode.

#68 Phylogenetic inference from raw reads and Read2Tree with David Dylus

2023-08-2849:11

In this episode, David Dylus talks about Read2Tree, a tool that builds alignment matrices and phylogenetic trees from raw sequencing reads. By leveraging the database of orthologous genes called OMA, Read2Tree bypasses traditional, time-consuming steps such as genome assembly, annotation and all-versus-all sequence comparisons. Links: Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree (David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck, Christophe Dessimoz) Background story Read2Tree on GitHub OMA browser The Guardian’s podcast about Victoria Amelina and Volodymyr Vakulenko If you enjoyed this episode, please consider supporting the podcast on Patreon.

#67 AlphaFold and variant effect prediction with Amelie Stein

2023-07-2935:25

This is the third and final episode in the AlphaFold series, originally recorded on February 23, 2022, with Amelie Stein, now an associate professor at the University of Copenhagen. In the episode, Amelie explains what 𝛥𝛥G is, how it informs us whether a particular protein mutation affects its stability, and how AlphaFold 2 helps in this analysis. A note from Amelie: Something that has happened in the meantime is the publication of methods that predict 𝛥𝛥G with ML methods, so much faster than Rosetta. One of them, RaSP, is from our group, while ddMut is from another subset of authors of the AF2 community assessment paper. Other links: A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H. M. Rodrigues, Alistair S. Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V. Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Durairaj, David B. Ascher, Janet M. Thornton, Norman E. Davey, Amelie Stein, Arne Elofsson, Tristan I. Croll & Pedro Beltrao) A crime in the making: Russia’s atrocities — the podcast episode about the Olenivka prison massacre If you enjoyed this episode, please consider supporting the podcast on Patreon.

#66 AlphaFold and shape-mers with Janani Durairaj

2023-07-1020:51

This is the second episode in the AlphaFold series, originally recorded on February 14, 2022, with Janani Durairaj, a postdoctoral researcher at the University of Basel. Janani talks about how she used shape-mers and topic modelling to discover classes of proteins assembled by AlphaFold 2 that were absent from the Protein Data Bank (PDB). The bioinformatics discussion starts at 03:35. Links: A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H. M. Rodrigues, Alistair S. Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V. Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Durairaj, David B. Ascher, Janet M. Thornton, Norman E. Davey, Amelie Stein, Arne Elofsson, Tristan I. Croll & Pedro Beltrao) The Protein Universe Atlas What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds (Janani Durairaj, Andrew M. Waterhouse, Toomas Mets, Tetiana Brodiazhenko, Minhal Abdullah, Gabriel Studer, Mehmet Akdel, Antonina Andreeva, Alex Bateman, Tanel Tenson, Vasili Hauryliuk, Torsten Schwede, Joana Pereira) Geometricus: Protein Structures as Shape-mers derived from Moment Invariants on GitHub The group page The Folded Weekly newsletter A New York Times article about the Kramatorsk missile strike. The Instagram video, part of which you can hear at the beginning of the episode, appears to have been deleted. If you enjoyed this episode, please consider supporting the podcast on Patreon.

#65 AlphaFold and protein interactions with Pedro Beltrao

2023-06-2152:231

In this episode, originally recorded on February 9, 2022, Roman talks to Pedro Beltrao about AlphaFold, the software developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. Pedro is an associate professor at ETH Zurich and the coordinator of the structural biology community assessment of AlphaFold2 applications project, which involved over 30 scientists from different institutions. Pedro talks about the origins of the project, its main findings, the importance of the confidence metric that AlphaFold assigns to its predictions, and Pedro’s own area of interest — predicting pockets in proteins and protein-protein interactions. Links: A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. Laskowski, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H. M. Rodrigues, Alistair S. Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Jérôme Basquin, Kresten Lindorff-Larsen, Alex Bateman, Andrey V. Kajava, Alfonso Valencia, Sergey Ovchinnikov, Janani Durairaj, David B. Ascher, Janet M. Thornton, Norman E. Davey, Amelie Stein, Arne Elofsson, Tristan I. Croll & Pedro Beltrao) Pedro’s group at ETH Zurich If you enjoyed this episode, please consider supporting the podcast on Patreon.

#64 Enformer: predicting gene expression from sequence with Žiga Avsec

2021-11-0959:413

In this episode, Jacob Schreiber interviews Žiga Avsec about a recently released model, Enformer. Their discussion begins with life differences between academia and industry, specifically about how research is conducted in the two settings. Then, they discuss the Enformer model, how it builds on previous work, and the potential that models like it have for genomics research in the future. Finally, they have a high-level discussion on the state of modern deep learning libraries and which ones they use in their day-to-day developing. Links: Effective gene expression prediction from sequence by integrating long-range interactions (Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli & David R. Kelley ) DeepMind Blog Post (Žiga Avsec) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#63 Bioinformatics Contest 2021 with Maksym Kovalchuk and James Matthew Holt

2021-09-2701:00:47

The Bioinformatics Contest is back this year, and we are back to discuss it! This year’s contest winners Maksym Kovalchuk (1st prize) and Matt Holt (2nd prize) talk about how they approach participating in the contest and what strategies have earned them the top scores. Timestamps and links for the individual problems: 00:10:36 Genotype Imputation 00:21:26 Causative Mutation 00:30:27 Superspreaders 00:37:22 Minor Haplotype 00:46:37 Isoform Matching Links: Matt’s solutions Max’s solutions If you enjoyed this episode, please consider supporting the podcast on Patreon.

#62 Steady states of metabolic networks and Dingo with Apostolos Chalkis

2021-07-2838:25

In this episode, Apostolos Chalkis presents sampling steady states of metabolic networks as an alternative to the widely used flux balance analysis (FBA). We also discuss dingo, a Python package written by Apostolos that employs geometric random walks to sample steady states. You can see dingo in action here. Links: Dingo on GitHub Searching for COVID-19 treatments using metabolic networks Tweag open source fellowships This episode was originally published on the Compositional podcast. If you enjoyed this episode, please consider supporting the podcast on Patreon.

#61 3D genome organization and GRiNCH with Da-Inn Erika Lee

2021-06-2301:09:41

In this episode, Jacob Schreiber interviews Da-Inn Erika Lee about data and computational methods for making sense of 3D genome structure. They begin their discussion by talking about 3D genome structure at a high level and the challenges in working with such data. Then, they discuss a method recently developed by Erika, named GRiNCH, that mines this data to identify spans of the genome that cluster together in 3D space and potentially help control gene regulation. Links: GRiNCH: simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix factorization (Da-Inn Lee and Sushmita Roy) GRiNCH Project Page In silico prediction of high-resolution Hi-C interaction matrices (Shilu Zhang, Deborah Chasman, Sara Knaack, and Sushmita Roy) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#60 Differential gene expression and DESeq2 with Michael Love

2021-05-1201:31:154

In this episode, Michael Love joins us to talk about the differential gene expression analysis from bulk RNA-Seq data. We talk about the history of Mike’s own differential expression package, DESeq2, as well as other packages in this space, like edgeR and limma, and the theory they are based upon. Mike also shares his experience of being the author and maintainer of a popular bioninformatics package. Links: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (Love, M.I., Huber, W. & Anders, S.) DESeq2 on Bioconductor Chan Zuckerberg Initiative: Ensuring Reproducible Transcriptomic Analysis with DESeq2 and tximeta And a more comprehensive set of links from Mike himself: limma, the original paper and limma-voom: https://pubmed.ncbi.nlm.nih.gov/16646809/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053721/ edgeR papers: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/ The recent manuscript mentioned from the Kendziorski lab, which has a Gamma-Poisson hierarchical structure, although it does not in general reduce to the Negative Binomial: https://doi.org/10.1101/2020.10.28.359901 We talk about robust steps for estimating the middle of the dispersion prior distribution, references are Anders and Huber 2010 (DESeq), Eling et al 2018 (one of the BASiCS papers), and Phipson et al 2016: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218662/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6167088/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373812/ The Stan software: https://mc-stan.org/ We talk about using publicly available data as a prior, references I mention are the McCall et al paper using publicly available data to ask if a gene is expressed, and a new manuscript from my lab that compares splicing in a sample to GTEx as a reference panel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013751/ https://doi.org/10.1101/856401 Regarding estimating the width of the dispersion prior, references are the Robinson and Smyth 2007 paper, McCarthy et al 2012 (edgeR), and Wu et al 2013 (DSS): https://pubmed.ncbi.nlm.nih.gov/17881408/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3378882/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3590927/ Schurch et al 2016, a RNA-seq dataset with many replicates, helpful for benchmarking: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/ Stephens paper on the false sign rate (ash): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379932/ Heavy-tailed distributions for effect sizes, Zhu et al 2018: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581436/ I credit Kevin Blighe and Alexander Toenges, who help to answer lots of DESeq2 questions on the support site: https://www.biostars.org/u/41557/ https://www.biostars.org/u/25721/ The EOSS award, which has funded vizWithSCE by Kwame Forbes, and nullranges by Wancen Mu and Eric Davis: https://chanzuckerberg.com/eoss/proposals/ensuring-reproducible-transcriptomic-analysis-with-deseq2-and-tximeta/ https://kwameforbes.github.io/vizWithSCE/ https://nullranges.github.io/nullranges/ One of the recent papers from my lab, MRLocus for eQTL and GWAS integration: https://mikelove.github.io/mrlocus/ If you enjoyed this episode, please consider supporting the podcast on Patreon.

#59 Proteomics calibration with Lindsay Pino

2021-04-2148:261

In this episode, Lindsay Pino discusses the challenges of making quantitative measurements in the field of proteomics. Specifically, she discusses the difficulties of comparing measurements across different samples, potentially acquired in different labs, as well as a method she has developed recently for calibrating these measurements without the need for expensive reagents. The discussion then turns more broadly to questions in genomics that can potentially be addressed using proteomic measurements. Links: Talus Bioscience Matrix-Matched Calibration Curves for Asssessing Analytical Figures of Merit in Quantitative Proteomics (Lindsay K. Pino, Brian C. Searle, Han-Yin Yang, Andrew N. Hoofnagle, William S. Noble, and Michael J. MacCross) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#58 B cell maturation and class switching with Hamish King

2021-03-3101:29:111

In this episode, we learn about B cell maturation and class switching from Hamish King. Hamish recently published a paper on this subject in Science Immunology, where he and his coauthors analyzed gene expression and antibody repertoire data from human tonsils. In the episode Hamish talks about some of the interesting B cell states he uncovered and shares his thoughts on questions such as «When does a B cell decide to class-switch?» and «Why is the antibody isotype correlated with its affinity?» Links: Single-cell analysis of human B cell maturation predicts how antibody class switching shapes selection dynamics (Hamish W. King, Nara Orban, John C. Riches, Andrew J. Clear, Gary Warnes, Sarah A. Teichmann, Louisa K. James) (paywalled by Science Immunology) Antibody repertoire and gene expression dynamics of diverse human B cell states during affinity maturation (the preprint of the above Science Immunology paper) www.tonsilimmune.org: An immune cell atlas of the human tonsil and B cell maturation If you enjoyed this episode, please consider supporting the podcast on Patreon.

#57 Enhancers with Molly Gasperini

2021-03-1046:57

In this episode, Jacob Schreiber interviews Molly Gasperini about enhancer elements. They begin their discussion by talking about Octant Bio, and then dive into the surprisingly difficult task of defining enhancers and determining the mechanisms that enable them to regulate gene expression. Links: Octant Bio Towards a comprehensive catalogue of validated and target-linked human enhancers (Molly Gasperini, Jacob M. Tome, and Jay Shendure) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#56 Polygenic risk scores in admixed populations with Bárbara Bitarello

2021-02-1701:30:122

Polygenic risk scores (PRS) rely on the genome-wide association studies (GWAS) to predict the phenotype based on the genotype. However, the prediction accuracy suffers when GWAS from one population are used to calculate PRS within a different population, which is a problem because the majority of the GWAS are done on cohorts of European ancestry. In this episode, Bárbara Bitarello helps us understand how PRS work and why they don’t transfer well across populations. Links: Polygenic Scores for Height in Admixed Populations (Bárbara D. Bitarello, Iain Mathieson) What is ancestry? (Iain Mathieson, Aylwyn Scally) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#55 Phylogenetics and the likelihood gradient with Xiang Ji

2021-01-1357:02

In this episode, we chat about phylogenetics with Xiang Ji. We start with a general introduction to the field and then go deeper into the likelihood-based methods (maximum likelihood and Bayesian inference). In particular, we talk about the different ways to calculate the likelihood gradient, including a linear-time exact gradient algorithm recently published by Xiang and his colleagues. Links: Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics (Xiang Ji, Zhenyu Zhang, Andrew Holbrook, Akihiko Nishimura, Guy Baele, Andrew Rambaut, Philippe Lemey, Marc A Suchard) BEAGLE: the package that implements the gradient algorithm BEAST: the program that implements the Hamiltonian Monte Carlo sampler and the molecular clock models If you enjoyed this episode, please consider supporting the podcast on Patreon.

#54 Seeding methods for read alignment with Markus Schmidt

2020-12-1601:00:46

In this episode, Markus Schmidt explains how seeding in read alignment works. We define and compare k-mers, minimizers, MEMs, SMEMs, and maximal spanning seeds. Markus also presents his recent work on computing variable-sized seeds (MEMs, SMEMs, and maximal spanning seeds) from fixed-sized seeds (k-mers and minimizers) and his Modular Aligner. Links: A performant bridge between fixed-size and variable-size seeding (Arne Kutzner, Pok-Son Kim, Markus Schmidt) MA the Modular Aligner Calibrating Seed-Based Heuristics to Map Short Reads With Sesame (Guillaume J. Filion, Ruggero Cortini, Eduard Zorita) — another interesting recent work on seeding methods (though we didn’t get to discuss it in this episode) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#53 Real-time quantitative proteomics with Devin Schweppe

2020-11-1801:03:13

In this episode, Jacob Schreiber interviews Devin Schweppe about the analysis of mass spectrometry data in the field of proteomics. They begin by delving into the different types of mass spectrometry methods, including MS1, MS2, and, MS3, and the reasons for using each. They then discuss a recent paper from Devin, Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics that involved building a real-time system for quantifying proteomic samples from MS3, and the types of analyses that this system allows one to do. Links: Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics (Devin K. Schweppe, Jimmy K. Eng, Qing Yu, Derek Bailey, Ramin Rad, Jose Navarrete-Perea, Edward L. Huttlin, Brian K. Erickson, Joao A. Paulo, and Steven P. Gygi) Benchmarking the Orbitrap Tribrid Eclipse for Next Generation Multiplexed Proteomics (Qing Yu, Joao A Paulo, Jose Naverrete-Perea, Graeme C McAlister, Jesse D Canterbury, Derek J Bailey, Aaron M Robitaille, Romain Huguet, Vlad Zabrouskov, Steven P Gygi, Devin K Schweppe) Improved Monoisotopic Mass Estimation for Deeper Proteome Coverage (Ramin Rad, Jiaming Li, Julian Mintseris, Jeremy O’Connell, Steven P. Gygi, and Devin K. Schweppe) Schweppe Lab Website (Hiring!) If you enjoyed this episode, please consider supporting the podcast on Patreon.

#52 How 23andMe finds identical-by-descent segments with William Freyman

2020-10-2742:40

In this episode, Will Freyman talks about identity-by-descent (IBD): how it’s used at 23andMe, and how the templated positional Burrows-Wheeler transform can find IBD segments in the presence of genotyping and phasing errors. Links: Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform (William A. Freyman, Kimberly F. McManus, Suyash S. Shringarpure, Ethan M. Jewett, Katarzyna Bryc, the 23andMe Research Team, Adam Auton) 23andMe research If you enjoyed this episode, please consider supporting the podcast on Patreon.

#51 Basset and Basenji with David Kelley

2020-10-0701:13:58

In this episode, Jacob Schreiber interviews David Kelley about machine learning models that can yield insight into the consequences of mutations on the genome. They begin their discussion by talking about Calico Labs, and then delve into a series of papers that David has written about using models, named Basset and Basenji, that connect genome sequence to functional activity and so can be used to quantify the effect of any mutation. Links: Calico Labs Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks (David R. Kelley, Jasper Snoek, and John Rinn) Sequential regulatory activity prediction across chromosomes with convolutional neural networks (David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, and Jaspar Snoek) Cross-species regulatory sequence activity prediction (David R. Kelley) Basenji GitHub Repo If you enjoyed this episode, please consider supporting the podcast on Patreon.

Comments (1)

Doug Warner

Hey! Looking for an inexpensive therapy creation service? You are on the right platform- https://drmental.org/teen-counseling-review . With our services you will get quality results. We understand the thrifty nature of our students and paying for academic therapy is not a priority.However, we promise excellent service at a pocket-friendly price.

Dec 1st

#box-pro-ellipsis-176731954168836{-webkit-line-clamp:2;}the bioinformatics chat

Doug Warner