{"id":840,"date":"2024-07-04T01:39:57","date_gmt":"2024-07-04T01:39:57","guid":{"rendered":"https:\/\/alex-jimenez.com\/?post_type=rara-portfolio&#038;p=840"},"modified":"2024-08-04T17:49:49","modified_gmt":"2024-08-04T17:49:49","slug":"transcriptomics-cho-alignment-pipeline","status":"publish","type":"rara-portfolio","link":"https:\/\/alex-jimenez.com\/?rara-portfolio=transcriptomics-cho-alignment-pipeline","title":{"rendered":"Transcriptomics CHO Alignment Pipeline"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">Summary<\/h3>\n\n\n\n<p>Using fastqc, HISAT2, featureCounts, samtools, and Trimmomatic, an alignment pipeline was made for RNAseq data to the Chinese hamster ovary (CHO) genome. The annotation files were taken from <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_000223135.1\/\" data-type=\"link\" data-id=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_000223135.1\/\">NCBI<\/a>. The workflow until the featureCounts step show&#8217;s an older annotation file. Note that the workflow is identical for the updated annotation file, but the updated annotation file provides higher alignments for the final output <a href=\"https:\/\/hbctraining.github.io\/Intro-to-rnaseq-hpc-O2\/lessons\/05_counting_reads.html\" data-type=\"link\" data-id=\"https:\/\/hbctraining.github.io\/Intro-to-rnaseq-hpc-O2\/lessons\/05_counting_reads.html\">counts matrix.<\/a> <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Background<\/h3>\n\n\n\n<p>Raw files for the Transcriptomic data was taken from NCBI&#8217;s <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/gds\/?term=Chinese+hamster+ovary\" data-type=\"link\" data-id=\"https:\/\/www.ncbi.nlm.nih.gov\/gds\/?term=Chinese+hamster+ovary\">GEO datasets<\/a>. The exact steps for aligning the raw RNAseq output can be seen below<\/p>\n\n\n\n<ol>\n<li>Download an <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/sra\" data-type=\"link\" data-id=\"https:\/\/www.ncbi.nlm.nih.gov\/sra\">SRA database file<\/a> (or receive from <a href=\"https:\/\/www.illumina.com\" data-type=\"link\" data-id=\"https:\/\/www.illumina.com\">Illumina<\/a> sequencer device)<\/li>\n\n\n\n<li>Use the <a href=\"https:\/\/github.com\/ncbi\/sra-tools\" data-type=\"link\" data-id=\"https:\/\/github.com\/ncbi\/sra-tools\">SRA toolkit<\/a> to convert the SRA database file to a <a href=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/\" data-type=\"link\" data-id=\"https:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc\/\">fastqc<\/a> file<\/li>\n\n\n\n<li>Trim the fastqc file and run a quality check using <a href=\"http:\/\/www.usadellab.org\/cms\/?page=trimmomatic\" data-type=\"link\" data-id=\"http:\/\/www.usadellab.org\/cms\/?page=trimmomatic\">Trimmomatic<\/a><\/li>\n\n\n\n<li>Use <a href=\"http:\/\/www.htslib.org\" data-type=\"link\" data-id=\"http:\/\/www.htslib.org\">samtools<\/a> to convert the fastqc file to bam file<\/li>\n\n\n\n<li>Sort the bam file using samtools<\/li>\n\n\n\n<li>Download the <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_003668045.3\/\" data-type=\"link\" data-id=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_003668045.3\/\">CHO genome<\/a> sequences and annotation files<\/li>\n\n\n\n<li>Create genome indexes using <a href=\"http:\/\/daehwankimlab.github.io\/hisat2\/\" data-type=\"link\" data-id=\"http:\/\/daehwankimlab.github.io\/hisat2\/\">HISAT2<\/a><\/li>\n\n\n\n<li>Get the counts matrix using <a href=\"https:\/\/subread.sourceforge.net\/featureCounts.html\" data-type=\"link\" data-id=\"https:\/\/subread.sourceforge.net\/featureCounts.html\">featureCounts<\/a>, the genome annotation, and the bam file. <\/li>\n\n\n\n<li>Parse out the counts from the TSV file output<\/li>\n<\/ol>\n\n\n\n<p>The following is a workflow of building the genome indexes up until the final tab-separated-values (TSV )output. The SRA toolkit steps are not included In the following for two reasons: they were converted before making this portfolio, and Illumina devices often output the FASTQ file from sequencing runs.\u00a0<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Building Genome Indexes from Cricetulus Griseus genome<\/h4>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"395\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-1024x395.png\" alt=\"\" class=\"wp-image-841\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-1024x395.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-300x116.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-768x296.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-1536x592.png 1536w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build-156x60.png 156w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/HISAT2-Build.png 1680w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 1<\/strong>: Building Genome Indexes Using HISAT2<\/figcaption><\/figure><\/div>\n\n\n<p>The final output after this command has run is 8 genome indexes. These genome indexes are crucial for HISAT2 to align the FASTQ files.\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"69\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes-1024x69.png\" alt=\"\" class=\"wp-image-842\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes-1024x69.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes-300x20.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes-768x52.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes-600x40.png 600w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Genome-Indexes.png 1458w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2<\/strong>: Genome Indexes Output Format<\/figcaption><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">HiSAT2 for Aligning FASTQ Files to Cricetulus Griseus Genome<\/h4>\n\n\n\n<p>So it&#8217;s crucial to understand the protocol that our reads were generated under. Since this data was not generated by me, I&#8217;ll refer to the official sequence read archive for my\u00a0<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/sra\/?term=SRR15221365\" target=\"_blank\" rel=\"noreferrer noopener\">data<\/a>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" width=\"902\" height=\"254\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Protocol-of-SRA-file.png\" alt=\"\" class=\"wp-image-844\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Protocol-of-SRA-file.png 902w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Protocol-of-SRA-file-300x84.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Protocol-of-SRA-file-768x216.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Protocol-of-SRA-file-213x60.png 213w\" sizes=\"(max-width: 902px) 100vw, 902px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 3<\/strong>: RNA-Seq Data Summary<\/figcaption><\/figure><\/div>\n\n\n<p>Given the data is paired and cDNA, the following command is appropriate for aligning the reads. Note that if you do run a command that doesn&#8217;t take into account the paired nature of this data, you&#8217;ll get an error or close to 0% alignment.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"35\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-1024x35.png\" alt=\"\" class=\"wp-image-845\" style=\"width:848px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-1024x35.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-300x10.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-768x27.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-1536x53.png 1536w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM-600x21.png 600w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Screenshot-2024-07-02-at-5.30.44-PM.png 1908w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 4<\/strong>: HISAT2 Genome Alignment Command<\/figcaption><\/figure><\/div>\n\n\n<p>Here the output file was designated as a SAM file. Given the large size of these files, a BAM file, the compressed equivalent of a SAM, is often a better choice, especially in a cloud computing environment to minimize storage costs. Storage costs can be significant for a company constantly running its sequencer. Here is a summary report of the final alignment showing a successful alignment.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"386\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary-1024x386.png\" alt=\"\" class=\"wp-image-846\" style=\"width:814px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary-1024x386.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary-300x113.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary-768x289.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary-159x60.png 159w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Alignment-Summary.png 1226w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 5<\/strong>: FASTQ-Genome Alignment Summary<\/figcaption><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">Sorting the SAM\/BAM files<\/h4>\n\n\n\n<p>Sorting the SAM\/BAM file is mainly for efficiency and speed for any other downstream tasks that might happen: featureCounts, and variant calling.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"42\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files-1024x42.png\" alt=\"\" class=\"wp-image-847\" style=\"width:829px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files-1024x42.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files-300x12.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files-768x31.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files-600x24.png 600w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/Sort-SAM-files.png 1476w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 6:<\/strong> Executing File Sorting With Samtools<\/figcaption><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">FeatureCounts For Creating Counts Matrix<\/h4>\n\n\n\n<p>The following is a summary using the following <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_003668045.3\/\" data-type=\"link\" data-id=\"https:\/\/www.ncbi.nlm.nih.gov\/datasets\/genome\/GCF_003668045.3\/\">genome assembly submission<\/a> from NCBI. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"517\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary-1024x517.png\" alt=\"\" class=\"wp-image-848\" style=\"width:840px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary-1024x517.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary-300x151.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary-768x388.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary-119x60.png 119w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/07\/featureCounts-Summary.png 1292w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 7<\/strong>: featureCounts Summary of Gene Alignments<\/figcaption><\/figure><\/div>\n\n\n<h4 class=\"wp-block-heading\">High Level Programming Language (Python) for Further Analysis<\/h4>\n\n\n\n<p>Once the final counts matrix has been created, a higher-level programming language such as R or Python can be used for further analysis. R is a popular programming language in the biologist realm, but I prefer Python due to its broad package ecosystem in any domain. Python lags behind R slightly in the biotechnology world, but not by much due to its vast community. Some popular packages for analyzing RNAseq data include SCANpy, GSEApy, and other typical data science packages such as scikit-learn.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Future Improvements<\/h4>\n\n\n\n<p>If the protocol for read generation is not changing i.e. this is an in-house workflow for a pharmaceutical company, then the workflow can be automated using\u00a0<a href=\"https:\/\/www.ninjaone.com\/blog\/what-is-bash-scripting\/\" target=\"_blank\" rel=\"noreferrer noopener\">bash scripting<\/a>\u00a0and a higher-level programming language like Python. If Python is the language of choice, one can leverage the\u00a0<a href=\"https:\/\/docs.python.org\/3\/library\/subprocess.html\" target=\"_blank\" rel=\"noreferrer noopener\">subprocess module<\/a>\u00a0to communicate directly with the bash scripts. Using AWS, this workflow could be further automated with the following services so that any data generated from a sequencer could be processed seamlessly.<\/p>\n\n\n\n<ol>\n<li>AWS S3 Storage Bucket for Raw Data and Event Notifications for Data Updates<\/li>\n\n\n\n<li>AWS ECS to Create a Docker App to Execute Bash Script with Python Front End for AWS RDS connection.<\/li>\n\n\n\n<li>AWS Relational Database Service (RDS) for final storage of output data. <\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Summary Using fastqc, HISAT2, featureCounts, samtools, and Trimmomatic, an alignment pipeline was made for RNAseq data to the Chinese hamster ovary (CHO) genome. The annotation files were taken from NCBI. The workflow until the featureCounts step show&#8217;s an older annotation file. Note that the workflow is identical for the updated annotation file, but the updated &hellip; <\/p>\n","protected":false},"author":1,"featured_media":849,"comment_status":"open","ping_status":"closed","template":"","rara_portfolio_categories":[10,12],"_links":{"self":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/rara-portfolio\/840"}],"collection":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/rara-portfolio"}],"about":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/types\/rara-portfolio"}],"author":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=840"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/media\/849"}],"wp:attachment":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=840"}],"wp:term":[{"taxonomy":"rara_portfolio_categories","embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Frara_portfolio_categories&post=840"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}