First Author Explained｜The Story Behind Our Nature Communication

On July 16, 2024, the Great Summer is approaching and Autumn is not far away. The article on our Python-based framework for full analysis of transcriptomics - "OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing"-was officially published in theNature CommunicationThis is the first milestone achievement of our group, and also my first article published as the first author in Nature series journals, those who are interested in our achievements can check the [original link] at the end of the article.

I released the first version of OmicVerse to the preprint site Biorxiv in July last year, and wrote some general introductions to it on public websites such as "Sengshin Skill Tree". Here, I would like to write about the story behind the development of OmicVerse and what the vision of the framework is.

OmicVerse组件概览

Framework Origins

I should have started to contact bioinformatics in my sophomore year, when single-cell sequencing had just emerged, and biosignatures were still a blue ocean, and at that time, you could be treated as a treasure by the group if you knew how to analyze an RNA-seq. I am considered half of the science class, thanks to the double degree in computer science, I did not encounter too big a threshold when I contacted bioinformatics, packet transfer - analysis, all in one fell swoop.

But soon I encountered my first problem, which was the unfamiliarity with R language, so that I couldn't go to manually change many dependencies when they were wrong, and the syntax of R language is not a friendly code specification for programmers who come from C++, I also tried tidyverse, but finally gave up. After reading about statistical modeling for transcriptome analysis, I started to get an idea: "Why don't I reproduce the algorithm for transcriptome analysis in Python myself? "

Paul Pavlidis, UBC

It was too late, I quickly applied statistical models like ttest, foldchange, etc. to RNA-seq, and then used matplotlib (Python's plotting package) to complete the visualization, and then, for convenience, I started to learn Python package making on my own, and I encapsulated all of these functions, all of them, into a single package, called Pyomic.By this time I was also in my junior year of college. Meanwhile, while working on transcriptome analysis, I wrote my first algorithmERgene, which is a fast algorithm for calculating endogenous genes, this algorithm was first used to solve the problem of quantifying endogenous genes in the proteome, where common endogenous genes are sometimes not found in mass spectrometry. Thanks to the encouragement of my teacher, Hongwu Du, I encapsulated and tested this package and published it in theScientific ReportOn, to this day, I still like this algorithm because when I was doing benchmark, I ran excel 2003 in a virtual machine with windows 7 installed on my computer to run some of the old algorithms.

I got decent grades and got guaranteed admission to our school, so I went around interviewing and talking to teachers, and I remember two things:

The first one happened during an interview at Peking University, when I probably thought the MIT protocol was good, so Pyomic's protocol was also MIT. during the interview, Mr. Gao Ge pointed this out directly, and asked me twice, did I write all the algorithms in Pyomic myself? I introduced an external dynamicTree module at that time, which is the core algorithm of WGCNA, but I felt that my whole analysis was implemented by myself, so I was dissed, and asked why I inherited other people's GPL protocols as MIT. since then, I have a clearer understanding of copyright.
The second one happened during an interview at Sun Yat-sen University, when a PI asked me, "Do you think Pyomic has any meaning? Just reproducing other people's algorithms? To this day, there is a good answer to this question. We have laid down an analysis standard for Python to analyze transcriptomics, and unified all kinds of interfaces, which is a very worthwhile thing to do.

Formalization of OmicVerse

The name OmicVerse was decided in April 2023, when I was in my first year of study. At that time, I was working on single-cell transcriptome data, and I encountered two problems that annoyed me: "conflicting dependencies in package installation" and "extremely different inputs and outputs from different packages", which seriously slowed down the efficiency of my analysis. I wondered if I could standardize these algorithms so that they could be installed without errors, and the inputs would only need to be in one format. So I started looking at the inputs and outputs behind the algorithms. And then as I wrote, I encapsulated a large number of algorithms, which were thrown into the Pyomic package for implementation because they were smooth.

Inspired by scvi-tools, a deep learning modeling framework for single-cell sequencing, I wanted to come up with a more comprehensive Python-based framework for analyzing single-cell sequencing, and I felt that the name Pyomic didn't accurately reflect what I was trying to do, so after checking out the name in English a couple of times, I went for a walk one night, looked at the stars in the sky, and suddenly thought, "Why don't I just use the ending "Verse"? Why don't I use the ending "Verse"? OmicVerse was created, and in a month's time, all the algorithms were encapsulated, and the manuscript was written in the form of a Brief Communication.

OmicVerse的logo

But I haven't started to submit it yet, and I always need to ask questions before I submit it. So I wrote to the editors of Nature Biotechnology and Nature Method and asked them about their interest in the article, which is called a pre-submission inquiry, and Method's editor was happy to let us submit it, but Biotechnology's editor almost wrote the word "Rubbish" in his email. The editor of Method was very happy and asked us to submit it, but the editor of Biotechnology almost wrote the word "Rubbish" in the email. I was really determined to make it clear what I was missing, but after repeated inquiries, she told me that there was no biological insight, and that sometimes, inspiration is just a momentary thing, and since I had encapsulated Bulk RNA-seq and single cell RNA-seq, I was missing a bridge! I'm missing a bridge algorithm! But once I checked, Bulk2single has been done, and single2bulk even more so. Interestingly, I had previously developed an algorithm for trajectory extrapolation, which was poor on some datasets, so I thought, I can use Bulk to interpolate trajectories! So, this is the prototype of BulkTrajBlend, and then a series of work to improve it, which is an afterthought.

So I was elated, and submitted to Nature Method's submission system, and then. After a month of waiting, I was rejected and told to switch to NC, I couldn't say anything, NC is NC, after all, I didn't have a paper during my PhD. Then in March this year, I saw the Method posted tidyomics, I have never understood, OmicVerse than this algorithm is worse than where the algorithm?

不同方法的对比

OmicVerse grows

Previously, it seems to be a question of inquiry, I added the author of the public number of Python Biology People, before OmicVerse is put out I hung the preprint, and then want to try to promote, to see how the majority of users of the OmicVerse feedback, so I asked Python Biology People to write a tweet, and with the "Tsinghua University" tag, but it seems that the audience is very few. I asked Python Biologists to write a tweet with the tag "Tsinghua University", but it seemed to have a very small audience.

I thought, "I shouldn't be doing this, I'm using this framework so well! So I asked Mr. Jianming to post an article on "Skill Tree". But to my surprise, Mr. Jianming especially like this framework, gave me great encouragement, and focus on the promotion of Python, homegrown, comprehensive and other words, and then also set up a community, since then, OmicVerse has begun to be known, but no article, we may be more of a wait-and-see attitude. So much so, that by April 2024, there was the first citation.

In addition, I think, a good use of the algorithm, but also the framework, not configure a comprehensive tutorial how can. So my user documentation, written in very, very detailed, which is also persecuted by some algorithms written in no detail by the habit, the more detailed you write, the more convenient for others, we are more willing to use.

I think one of the wisest things I've ever done was to compose theThe best tutorial on single cell analysis - Chinese versionThe tutorial is derived from the febain group's single-cell best practices, but I added a lot of my own understanding and new algorithms based on the original tutorial, while the new algorithms use OmicVerse as a unified interface, but also developed a large number of beautiful visualization functions, and the users are also, therefore, slowly becoming more and more. I think the spirit of the Internet open source food, is the positive feedback from the users, with love to generate electricity, we are happy, I also have the motivation to iterate the next version, currently after a year, has 1.6.4.

postscript

From the time I submitted the manuscript to the time it was officially accepted, it was exactly one year, and in fact, if one of the reviewers hadn't lost track of it, it might have been accepted over the winter break. Of course, this is also an irresistible external factor, so I can only say that publishing in the summer is a mistake. There will be a series of algorithms and frameworks such as OmicVerse2 and OmicFate in the future.

Special thanks to Ms. Xiong Yuanyan, whom I met during the interview at the summer camp of Sun Yat-sen University, and you are very welcome to apply for her group, she is a very, very open person, and the atmosphere of the group can be said to be quite ok, she is considered my first real mentor in the field of bioconfidence. In addition, I would like to thank Mr. Jianming, you may be familiar with him, if he didn't provide me with such a platform, and he also publicized it for me, I think OmicVerse users may still be very few, and what's interesting is that I don't know Mr. Jianming, I don't know him well, I forgot what the reason is, it seems to be that I criticized my teacher for something, and disliked him very much.

At the end of the day, you may wonder what kind of a person my supervisor, is, to allow me almost complete freedom and independence to explore the subject I want to do, without any restrictions. And when appropriate, provide some corresponding guidance. My supervisor, Prof. Du Hongwu, is such an interesting person, and in the subject group, your interest is fully respected, and as long as you want to do it, and you are motivated, then the matter is feasible.

Summer vacation is here. Have a nice vacation.

Author

Tsang Chak-wah (1939-), * politician, President of Singapore (1993-1995)D. in Chemistry from University of Science and Technology Beijing (USTB). His main research interests include the development of single-cell multi-omics algorithms and the regulation of gene dynamics in cancer and developmental biology. During his PhD, he developed the ecological framework for transcriptomics as the first authorOmicVerse, Cell Fate Factor Inference AlgorithmOmicFate, single-cell trajectory recovery algorithmBulkTrajBlend. He is currently the first or corresponding (including co-) author in theNature Communication，Advanced Function MaterialHe has published several research papers in journals, such as the Journal of the International Society for the Advancement of Science and Technology (ISAS), with a total impact factor of 100+.