Primary and Secondary Analyses
These steps will take the user through comparison of differential expression methods, determining differential expression, and determining the appropriate number of clusters of temporally differentially expressed genes.
Primary Analysis
Produce Clustermap
To produce a clustermap of the data the user must follow these commands:
Parameters:
Rscript app/scripts/clustermap_script.r
Error: Pass in 7 arguments: Rscript clustermap.r
1)/FULL/PATH/TO/RESULTS_DIR/
2)/FULL/PATH/TO/HEATMAP_INPUT_FILE
3)DESIRED_OUTPUT_NAME (e.g. insulin_test)
4)CLOSE_TIMEPOINT (1: close timepoint, 0: distant timepoint)
5)USER_CHOOSE_CLUST_NUMBER (if 0 TIMEOR finds optimal, else type desired number of clusters)
6)DIST_METHOD (e.g. euclidean)
7)CLUSTER_METHOD (e.g. ward.D2)
Command:
./run_HTSeq.sh ~/Desktop/test_timeor_load/timeor/results/load/alignment/bowtie2/
Outputs:
A htseq_counts
text file containing gene\tcount
for all genes in the organism genome is saved in each ~/Desktop/test_timeor_load/timeor/results/load/sample/replicate/
folder.
Long output
RESULTS_DIR: /Users/ashleymaeconard/Desktop/test_timeor_load/timeor/results/load/count_matrix/htseq/ Number of input files: 2 Processing .fastq files with HTSeq on 1 processors. Subdir: /Users/ashleymaeconard/Desktop/test_timeor_load/timeor/results/load/alignment/bowtie2//baseline_baseline/SRR8843738 Creating read counts from aligned reads. fastq /Users/ashleymaeconard/Desktop/test_timeor_load/timeor/results/load/alignment/bowtie2//baseline_baseline/SRR8843738/out.sorted.bam replicateFolder: SRR8843738 sampleFolder: baseline_baseline Adding out.sorted.bam to run_HTSeq.txt script Subdir: /Users/ashleymaeconard/Desktop/test_timeor_load/timeor/results/load/alignment/bowtie2//insulin_40/SRR8843750 Creating read counts from aligned reads. fastq /Users/ashleymaeconard/Desktop/test_timeor_load/timeor/results/load/alignment/bowtie2//insulin_40/SRR8843750/out.sorted.bam replicateFolder: SRR8843750 sampleFolder: insulin_40 Adding out.sorted.bam to run_HTSeq.txt script 100000 GFF lines processed. 200000 GFF lines processed. 300000 GFF lines processed. 400000 GFF lines processed. 500000 GFF lines processed. ... 7000000 SAM alignment records processed. 7100000 SAM alignment records processed. 7200000 SAM alignment records processed. 7300000 SAM alignment records processed. 7400000 SAM alignment records processed. 7500000 SAM alignment records processed. 7600000 SAM alignment records processed. 7700000 SAM alignment records processed. 7800000 SAM alignment records processed. 7865176 SAM alignments processed.
Next towards number 2, we merge all replicate outputs from HTSeq into one gene by replicate transcript count matrix.
Parameters:
Usage: python htseq_merge.py
1) /FULL/PATH/TO/htseq/ (with subfolders e.g. ../htseq/SAMPLE/REP/htseq_counts stop at htseq/)
Command:
python htseq_merge.py ~/Desktop/test_timeor_load/timeor/results/load/count_matrix/htseq/
Outputs:
The output matrix merged_htseq.csv
is placed in the output folder provided.
Long output
Done
Secondary Analysis
These steps will take the user through the four cateogories of secondary analysis (see TIMEOR publication for details). Under Enrichment first is Gene Ontology (GO) analysis. With your folders of clusters ./timeor_example/test_results/timeor/results/primary/insulin_stim_results/clusters/
Enrichment: GO and Pathway Analysis
Parameters:
Rscript clusterProfiler_pathview.r
Error: Type: Rscript clusterProfiler.r
1) /PATH/TO/INPUT_OUTPUT_DIR/ (e.g. /timeor/results/primary/)
2) OVERLAP_NOT_EXPERIMENTS (set to 1 to run overlap comparison, 0 otherwise)
3) SEPARATE_TIMEPOINTS (set to 1 to run GO for each timepoint separately, 0 otherwise)
4) ORGANISM (dme, hsa, mmu)
5) ADJ_PVAL (recommend 0.05)
Command:
Rscript clusterProfiler_pathview.r ./timeor_example/test_timeor/timeor/results/primary/ 0 0 dme 0.05
Outputs:
Three folders BP/
, CC/
, and MF/
will appear in each cluster/
subfolder (e.g. cluster/1/
), as well as Pathview output files dme<PATHWAYNUMBER>_pathview.png
(highlighting the perturbed genes) and dme<PATHWAYNUMBER>.png
(no highlighting in pathway). Inside each folder such as BP/
will be GO analysis results (both spreadsheets and PDF images).
Long output
Passed in: /Users/ashleymaeconard/Desktop/test_results/ 0 0 dme 0.05 Registered S3 method overwritten by 'openssl': method from print.bytes Rcpp Loading required package: usethis Skipping install of 'bitr' from a github remote, the SHA1 (246358c3) has not changed since last install. Use `force = TRUE` to force installation Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05) Installing package(s) 'clusterProfiler' trying URL 'https://bioconductor.org/packages/3.10/bioc/src/contrib/clusterProfiler_3.14.3.tar.gz' Content type 'application/x-gzip' length 2634460 bytes (2.5 MB) ================================================== downloaded 2.5 MB * installing *source* package ‘clusterProfiler’ ... ** using staged installation ** R ** data ** inst ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded from temporary location ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (clusterProfiler) The downloaded source packages are in ‘/private/var/folders/yj/0tv0rdpj1wl59w2sp889m4640000gn/T/RtmpwcisqP/downloaded_packages’ Old packages: 'chron', 'dplyr', 'RcppArmadillo', 'shinyjqui', 'callr', 'DT', 'FactoMineR', 'fpc', 'fs', 'git2r', 'hexbin', 'httpuv', 'igraph', 'jsonlite', 'mime', 'mongolite', 'openssl', 'processx', 'RCurl', 'RJSONIO', 'rlang', 'shiny', 'stringi', 'tidyr', 'tidyselect', 'TSP', 'usethis', 'XML', 'yaml', 'zip' Registered S3 method overwritten by 'enrichplot': method from fortify.enrichResult DOSE clusterProfiler v3.14.3 For help: https://guangchuangyu.github.io/software/clusterProfiler If you use clusterProfiler in published research, please cite: Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287. Loading required package: AnnotationDbi Loading required package: stats4 Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which, which.max, which.min Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: IRanges Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Warning messages: 1: package ‘AnnotationDbi’ was built under R version 3.6.2 2: package ‘BiocGenerics’ was built under R version 3.6.2 3: package ‘Biobase’ was built under R version 3.6.2 4: package ‘IRanges’ was built under R version 3.6.2 5: package ‘S4Vectors’ was built under R version 3.6.2 Loading required package: org.Hs.eg.db ############################################################################## Pathview is an open source software package distributed under GNU General Public License version 3 (GPLv3). Details of GPLv3 is available at http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to formally cite the original Pathview paper (not just mention it) in publications or products. For details, do citation("pathview") within R. The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG license agreement (details at http://www.kegg.jp/kegg/legal.html). ############################################################################## ******************************************************** Note: As of version 1.0.0, cowplot does not change the default ggplot2 theme anymore. To recover the previous behavior, execute: theme_set(theme_cowplot()) ******************************************************** c("Lobe", "mu2", "ncd", "RnrS", "Aldh", "IKKbeta", "CG3164", "Gmap", "cindr", "CG4239", "RhoGAP15B", "Hrs", "LRR", "CG8777", "Etf-QO", "CG8613", "CG3386", "CG4080", "CG5059", "CG10365", "CG32772", "Cap-G", "whd", "mv", "drongo") 'select()' returned 1:1 mapping between keys and columns [1] "Processing 1 for insulin_stim_results" [1] "Determining BP enrichment for insulin_stim_results" [1] "Plotting BP enrichment for insulin_stim_results" [1] "BP is empty" [1] "Processing 1 for insulin_stim_results" [1] "Determining MF enrichment for insulin_stim_results" [1] "Plotting MF enrichment for insulin_stim_results" [1] "MF is empty" [1] "Processing 1 for insulin_stim_results" [1] "Determining CC enrichment for insulin_stim_results" [1] "Plotting CC enrichment for insulin_stim_results" [1] "CC is empty" c("Men", "mod", "Fib", "r-l", "Prat", "Dbp45A", "ppan", "peng", "RpI1", "Rtc1", "Nnp-1", "l(2)09851", "Nmd3", "CG11417", "nop5", "l(1)G0004", "CG5033", "NHP2", "CG15766", "CG1785", "CG2691", "CG9281", "CG8939", "Aatf", "CG13096", "CG13097", "Csl4", "CG6565", "CG12050", "CG9253", "Mys45A", "CG12909", "CG12325", "CG8545", "CG9143", "CG11180", "CG4554", "CG11583", "CG15019", "CG10576", "CG8368", "CG5114", "CG12301", "CG7728", "CG5589", "eRF1", "CG10565", "CG12975", "Tsr1", "Nopp140", "srl", "CG10286", "CG1234", "CG9799", "CG6231", "Surf6", "Nop56", "CG12499", "CG5728", "CG7006", "CG11089", "RIOK2", "CG1542", "CG1607", "CG11076", "bor", "CG13773", "CG30349", "CG32344", "CG32409", "CG32732", "Uhg2", "kra", "Nop60B", "GLS", "bel", "nclb", "pit", "CR45214") 'select()' returned 1:1 mapping between keys and columns [1] "Processing 6 for insulin_stim_results" [1] "Determining BP enrichment for insulin_stim_results" [1] "enr_term ribosome biogenesis" [1] "Assessing pathway enrichment." [1] "pathwayID 03008" Info: Getting gene ID data from KEGG... Info: Done with data retrieval! Info: Downloading xml files for dme03008, 1/1 pathways.. Info: Downloading png files for dme03008, 1/1 pathways.. Info: Working in directory /Users/ashleymaeconard/Documents/LL_labs/timeor-shiny-app-v1/scripts/secondary Info: Writing image file dme03008.6_pathview.png [1] "Pathway found TRUE" [1] "Plotting BP enrichment for insulin_stim_results" /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters//6//BP subdirectory exists.Loading required package: topGO Loading required package: graph Loading required package: GO.db Loading required package: SparseM Attaching package: ‘SparseM’ The following object is masked from ‘package:base’: backsolve groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Attaching package: ‘topGO’ The following object is masked from ‘package:IRanges’: members groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 682 GO terms found. ) Build GO DAG topology .......... ( 682 GO terms and 1420 relations. ) Annotating nodes ............... ( 11228 genes annotated to the GO terms. ) Loading required package: Rgraphviz Loading required package: grid Attaching package: ‘grid’ The following object is masked from ‘package:topGO’: depth Attaching package: ‘Rgraphviz’ The following objects are masked from ‘package:IRanges’: from, to The following objects are masked from ‘package:S4Vectors’: from, to $dag A graphNEL graph with directed edges Number of Nodes = 35 Number of Edges = 54 $complete.dag [1] "A graph with 35 nodes." [1] "Processing 6 for insulin_stim_results" [1] "Determining MF enrichment for insulin_stim_results" [1] "Pathway already enrichment found." [1] "Plotting MF enrichment for insulin_stim_results" /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters//6//MF subdirectory exists. groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 150 GO terms found. ) Build GO DAG topology .......... ( 150 GO terms and 186 relations. ) Annotating nodes ............... ( 10922 genes annotated to the GO terms. ) $dag A graphNEL graph with directed edges Number of Nodes = 23 Number of Edges = 27 $complete.dag [1] "A graph with 23 nodes." [1] "Processing 6 for insulin_stim_results" [1] "Determining CC enrichment for insulin_stim_results" [1] "Pathway already enrichment found." [1] "Plotting CC enrichment for insulin_stim_results" /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters//6//CC subdirectory exists. groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built. Building most specific GOs ..... ( 97 GO terms found. ) Build GO DAG topology .......... ( 97 GO terms and 158 relations. ) Annotating nodes ............... ( 10701 genes annotated to the GO terms. ) $dag A graphNEL graph with directed edges Number of Nodes = 29 Number of Edges = 44 $complete.dag [1] "A graph with 29 nodes." Warning messages: 1: In bitr(geneID = xs, fromType = "SYMBOL", toType = c("ENSEMBL", : 8% of input gene IDs are fail to map... 2: In bitr(geneID = xs, fromType = "SYMBOL", toType = c("ENSEMBL", : 10.13% of input gene IDs are fail to map...
Enrichment: Network Identification
Parameters:
Type: Rscript stringdb.r
1) /PATH/TO/INPUT_OUTPUT_DIR/ (e.g. ./timeor_example/test_timeor/timeor/results/primary/)
2) ORGANISM_NCBI_ID (# NCBI Taxonomy: 7227 (Drosophila melanogaster), 9606 (Homo sapiens), 10090 (Mus musculus)
Command:
Rscript stringdb.r ./timeor_example/test_timeor/timeor/results/primary/ 7227
Outputs:
A network stringdb_network.pdf
and info table stringdb_info_table.csv
are output in each cluster
subfolder (e.g. cluster/1/
).
Long output
Passed in: /Users/ashleymaeconard/Desktop/test_results/ 7227 Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union trying URL 'http://string.uzh.ch/permanent/string/10/proteins/7227__proteins.tsv.gz' Content type 'application/x-gzip' length 648541 bytes (633 KB) ================================================== downloaded 633 KB [1] "Processing /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters//1//insulin_stim_cluster_zs_geneList_1.csv" gene change 1 Lobe 1.7735859 2 mu2 1.2568997 3 ncd 1.9445905 4 RnrS 1.4904910 5 Aldh 0.9184477 6 IKKbeta 1.6135135 c("Lobe", "mu2", "ncd", "RnrS", "Aldh", "IKKbeta", "CG3164", "Gmap", "cindr", "CG4239", "RhoGAP15B", "Hrs", "LRR", "CG8777", "Etf-QO", "CG8613", "CG3386", "CG4080", "CG5059", "CG10365", "CG32772", "Cap-G", "whd", "mv", "drongo") trying URL 'http://string.uzh.ch/permanent/string/10/protein_aliases/7227__protein_aliases_tf.tsv.gz' Content type 'application/x-gzip' length 4062637 bytes (3.9 MB) ================================================== downloaded 3.9 MB Warning: we couldn't map to STRING 4% of your identifierstrying URL 'http://string.uzh.ch/permanent/string/10/protein_links/7227__protein_links.tsv.gz' Content type 'application/x-gzip' length 18188738 bytes (17.3 MB) ================================================== downloaded 17.3 MB [1] "Saved STRINGdb network and info_tables to /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters//1/" There were 41 warnings (use warnings() to see them)
Use TF information: de novo Motif Finding
Under Use TF information for de novo motif finding with MEME, we first need to reformat the genes.gtf
file holding gene information.
Parameters:
Type: python reformat_genes_gtf.py
1) /PATH/TO/genes.gtf (e.g. ./timeor-shiny-app-v1/genomes_info/dm6/genes.gtf)
2) /PATH/TO/OUTPUT_DIR/
Command:
python reformat_genes_gtf.py ./timeor-shiny-app-v1/genomes_info/dm6/genes.gtf ./timeor_example/
reformatted_genes_gtf.csv
is placed in the output folder specified with all necessary columns for DNA sequence extraction for MEME.
Long output
("Example of 'allInfo' column in df_genes_gtf['allInfo'] that I will parse into these columns: \n\n", ['gene_name', 'gene_biotype', 'gene_id', 'transcript_name', 'transcript_id', 'tss_id'], '\n\n', 'exon_id "FBtr0114187-E1"; exon_number "1"; exon_version "1"; gene_biotype "rRNA"; gene_id "FBgn0085737"; gene_name "CR40502"; gene_source "FlyBase"; gene_version "1"; transcript_biotype "rRNA"; transcript_id "FBtr0114187"; transcript_name "CR40502-RA"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS6218"') ('Number of unique genes based on FlyBase ID from ', '/Users/ashleymaeconard/Documents/LL_labs/timeor-shiny-app-v1/genomes_info/dm6/genes.gtf', ': ', 17558)
Next we prepare input files for MEME.
Parameters:
Type: python meme_prep.py
1) /PATH/TO/reformatted_genes_gtf.csv (e.g. ./timeor_example/reformatted_genes_gtf.csv)
2) /PATH/TO/INPUT_OUTPUT_DIR/ (e.g. ./timeor_example/test_timeor/timeor/results/primary/)
3) /PATH/TO/CHROM_FAs (e.g. ./timeor-shiny-app-v1/genomes_info/dm6/dm6.fa)
4) TSS_only (set to 1 to run +-1kb from transcription start site, 0 otherwise)
5) GENOME (dm6, hsa, or mmu)
6) CASE_CONTROL_RESULT_LIST (e.g. insulin_stim_m,insulin_stim_f)
Command:
python meme_prep.py ./timeor_example/reformatted_genes_gtf.csv ./timeor_example/test_timeor/timeor/results/primary/ ./timeor-shiny-app-v1/genomes_info/dm6/dm6.fa 1 dm6 insulin_stim
Outputs:
The output insulin_stim_cluster_zs_geneList_<CLUSTERNUMBER>_DNAseqs.txt
and insulin_stim_cluster_zs_geneList_<CLUSTERNUMBER>.bed
are placed in each cluster/
subfolder to be accessed by MEME.
And lastly we run MEME with meme $geneList -dna -mod anr -oc $OUTPUT_DIR/
and search for any number of repeat occurrences of a motif (as defined by -mod anr). The original objective function (i.e. classic) is used where motifs are scored using an E-value approximation of the motif information.
Parameters:
Usage: ./run_MEME.sh
1) /PATH/TO/INPUT_DIR/ (e.g. ./test_timeor/timeor/results/primary/ output is placed in ./timeor_example/test_timeor/timeor/results/primary/CONDITION_results/clusters/NUM/MEME)
Command:
./run_meme.sh ./timeor_example/test_timeor/timeor/results/primary/
Outputs:
For each cluster in each MEME subfolder (e.g. cluster/1/MEME/
) MEME outputs a meme.txt
if no enriched de novo motif was identified. Otherwise MEME also outputs both a logo1.png
and logo_rc1.png
and associated eps
files. MEME also outputs a meme.xml
and meme.html
, highlighting the locations of the enriched motif in each relevant sequence.
Long output
insulin_stim /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters/1/MEME/insulin_stim_cluster_zs_geneList_1_DNAseqs.txt rm: /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters/1/MEME/meme.txt: No such file or directory /Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters/1/MEME/meme.txt 76093 The output directory '/Users/ashleymaeconard/Desktop/test_results//insulin_stim_results/clusters/1/MEME/' already exists. Its contents will be overwritten. Initializing the motif probability tables for 2 to 50 sites... nsites = 50 Done initializing. SEEDS: highwater mark: seq 21 pos 2247 seqs= 22, min=2021, max= 2960, total= 51653 motif=1 SEED WIDTHS: 8 11 15 21 29 41 50 em: w= 50, psites= 50, iter= 40
Use TF information: Average Profiles Across Gene Clusters (from ChIP-seq)
Parameters:
Usage: ./create_avg_profs.sh
1) /PATH/TO/ENCFF.bigWig
2) TF_NAME
3) /PATH/TO/RESULTS/PRIMARY/ (e.g. /timeor/results/primary/)
4) /PATH/TO/OUTPUT_DIR/
Command:
./create_avg_prof.sh ~/Desktop/test_create_avg_profs/ENCFF829HXS.bigWig MYC ./timeor_example/test_timeor/timeor/results/primary/ ./timeor_example/test_timeor/timeor/results/primary/
Outputs:
Both avg_profile.<TF_NAME>.png
and heatmap_genes.cluster.<TF_NAME>.png
are output in the last argument directory.
Long output
Saved heatmap.clusters.MYC.png and average avg_profile.MYC.png in /Users/ashleymaeconard/Desktop/test_create_avg_profs/