mirna-quant-predict
本流程說明如何從 small RNA-seq FASTQ 檔案進行 miRNA 定量(quantification),並進一步執行 miRNA target prediction(RNAhybrid 與 miRanda)。
1 miRNA 定量
本文使用 Sebastian Mackowiak & Marc Friedländer 開發之 rajewsky-lab/mirdeep2 工具進行 miRNA 定量分析。該工具可將 small RNA-seq 之 FASTA(或由 FASTQ 轉換而來之序列) 進行比對與解析,透過已知 miRNA 參考資料(如 miRBase)辨識成熟 miRNA,並計算各 miRNA 的表現量。
Marc R. Friedländer, Sebastian D. Mackowiak, Na Li, Wei Chen, Nikolaus Rajewsky, miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades, Nucleic Acids Research, Volume 40, Issue 1, 1 January 2012, Pages 37–52, https://doi.org/10.1093/nar/gkr688
miRDeep2 亦可同時預測潛在的新穎 miRNA(novel miRNA),其核心流程包括序列品質處理、比對至參考基因組、前驅 miRNA 結構辨識(hairpin structure)與表現量評估,最終輸出成熟 miRNA 之 read count、normalized expression 與相關品質指標,供後續差異表現分析與功能註解使用。
而本人 fork 版本僅建立 docker 環境並新增適用於人類樣本的 workflow,詳細分析流程請見 benson1231/mirdeep2。
2 miRNA 標的預測工具
miRNA target prediction 在研究上大致可分為三種主要方向:
- 序列互補性與結合能量預測演算法,例如:RNAhybrid、miRanda
- 演化保守性或機器學習模型預測,例如:TargetScan、miRDB
- 實驗驗證資料庫,例如:miRTarBase、starBase
通常會與資料庫或其他預測工具交叉比對,以降低假陽性。
首先說明如何利用 RNAhybrid 與 miRanda 演算法進行 miRNA–mRNA 配對分析,以預測潛在 target genes。
2.1 前處理
2.1.1 資料格式
需要準備 mirna.fa 與 target.fa 兩個 FASTA 檔案:
- mirna.fa:已鑑定出的 miRNA 序列(query),例如來自 small RNA-seq 或 miRBase 的 mature miRNA
>hsa-miR-21-5p
UAGCUUAUCAGACUGAUGUUGA
>hsa-miR-146a-5p
UGAGAACUGAAUUCCAUGGGUU
>hsa-miR-155-5p
UUAAUGCUAAUCGUGAUAGGGGU
- target.fa:欲預測結合的目標序列(target),通常為基因的 3′UTR 序列、mRNA 序列或候選基因區段
>TP53
AGTCTGACTGACTGACTGACTGACTGACTGATCGATCGATCGATCGATCGACTGACTGACTGACTGACTGA
>BRCA1
TGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGA
>PTEN
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
2.1.2 建立分析環境
使用 mamba 創建環境, mamba 用法可以參考 mamba介紹
mamba env create -f https://raw.githubusercontent.com/benson1231/tools/main/envs/mirna-predict.yaml
mamba activate mirna-predict2.2 使用 RNAhybrid 演算法
Krüger, Jan, and Marc Rehmsmeier. “RNAhybrid: microRNA target prediction easy, fast and flexible.” Nucleic acids research vol. 34,Web Server issue (2006): W451-4. doi:10.1093/nar/gkl243
RNAhybrid -s 3utr_human -t data/target.fa -q data/mirna.fa > results/hybrid.outtarget: TP53
length: 71
miRNA : hsa-miR-21-5p
length: 22
mfe: -15.2 kcal/mol
p-value: 0.993854
position 1
target 5' C A 3'
AGUCUGA UGA CUG
UCAGACU AUU GAU
miRNA 3' AGUUGUAG C 5'
RNAhybrid -h 查看使用說明(點擊展開結果)
Usage: RNAhybrid [options] [target sequence] [query sequence].
options:
-b <number of hits per target>
-c compact output
-d <xi>,<theta>
-f helix constraint
-h help
-m <max targetlength>
-n <max query length>
-u <max internal loop size (per side)>
-v <max bulge loop size>
-e <energy cut-off>
-p <p-value cut-off>
-s (3utr_fly|3utr_worm|3utr_human)
-g (ps|png|jpg|all)
-t <target file>
-q <query file>
Either a target file has to be given (FASTA format)
or one target sequence directly.
Either a query file has to be given (FASTA format)
or one query sequence directly.
The helix constraint format is "from,to", eg. -f 2,7 forces
structures to have a helix from position 2 to 7 with respect to the query.
<xi> and <theta> are the position and shape parameters, respectively,
of the extreme value distribution assumed for p-value calculation.
If omitted, they are estimated from the maximal duplex energy of the query.
In that case, a data set name has to be given with the -s flag.
PS graphical output not supported.
2.3 使用 miranda 演算法
Enright, Anton J et al. “MicroRNA targets in Drosophila.” Genome biology vol. 5,1 (2003): R1. doi:10.1186/gb-2003-5-1-r1
miranda data/mirna.fa data/target.fa -out results/miranda.out=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
miranda v3.3a microRNA Target Scanning Algorithm
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
(c) 2003 Memorial Sloan-Kettering Cancer Center, New York
Authors: Anton Enright, Bino John, Chris Sander and Debora Marks
(mirnatargets (at) cbio.mskcc.org - reaches all authors)
Software written by: Anton Enright
Distributed for anyone to use under the GNU Public License (GPL),
See the files 'COPYING' and 'LICENSE' for details
If you use this software please cite:
Enright AJ, John B, Gaul U, Tuschl T, Sander C and Marks DS;
(2003) Genome Biology; 5(1):R1.
miranda comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions; type `miranda --license' for details.
Current Settings:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Query Filename: data/mirna.fa
Reference Filename: data/target.fa
Gap Open Penalty: -9.000000
Gap Extend Penalty: -4.000000
Score Threshold: 140.000000
Energy Threshold: 1.000000 kcal/mol
Scaling Parameter: 4.000000
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Read Sequence:hsa-miR-21-5p (22 nt)
Read Sequence:TP53 (71 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: hsa-miR-21-5p vs TP53
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Score for this Scan:
No Hits Found above Threshold
Complete
Read Sequence:BRCA1 (63 nt)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing Scan: hsa-miR-21-5p vs BRCA1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Score for this Scan:
No Hits Found above Threshold
Complete
miranda --help 查看說明(點擊展開結果)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
miranda v3.3a microRNA Target Scanning Algorithm
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
(c) 2003 Memorial Sloan-Kettering Cancer Center, New York
Authors: Anton Enright, Bino John, Chris Sander and Debora Marks
(mirnatargets (at) cbio.mskcc.org - reaches all authors)
Software written by: Anton Enright
Distributed for anyone to use under the GNU Public License (GPL),
See the files 'COPYING' and 'LICENSE' for details
If you use this software please cite:
Enright AJ, John B, Gaul U, Tuschl T, Sander C and Marks DS;
(2003) Genome Biology; 5(1):R1.
miranda comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions; type `miranda --license' for details.
miRanda is an miRNA target scanner which aims to predict mRNA
targets for microRNAs using dynamic-programming alignment and
thermodynamics.
Usage: miranda file1 file2 [options..]
Where:
'file1' is a FASTA file with a microRNA query
'file2' is a FASTA file containing the sequence(s)
to be scanned.
OPTIONS
--help -h Display this message
--version -v Display version information
--license Display license information
Core algorithm parameters:
-sc S Set score threshold to S [DEFAULT: 140.0]
-en -E Set energy threshold to -E kcal/mol [DEFAULT: 1.0]
-scale Z Set scaling parameter to Z [DEFAULT: 4.0]
-strict Demand strict 5' seed pairing [DEFAULT: off]
Alignment parameters:
-go -X Set gap-open penalty to -X [DEFAULT: -4.0]
-ge -Y Set gap-extend penalty to -Y [DEFAULT: -9.0]
General Options:
-out file Output results to file [DEFAULT: off]
-quiet Output fewer event notifications [DEFAULT: off]
-trim T Trim reference sequences to T nt [DEFAULT: off]
-noenergy Do not perform thermodynamics [DEFAULT: off]
-restrict file Restricts scans to those between
specific miRNAs and UTRs
provided in a pairwise
tab-separated file [DEFAULT: off]
Other Options:
-keyval Key value pairs ?? [DEFAULT:]
This software will be further developed under the open source model,
coordinated by Anton Enright and Chris Sander (miranda (at) cbio.mskcc.org).
Please send bug reports to: miranda (at) cbio.mskcc.org.
3 miRNA 標的預測平台
3.1 TargetScan
TargetScan 是目前最廣泛使用的 miRNA 標的預測工具之一,最早由 Benjamin P. Lewis 等人在 Cell(2003)提出,其核心概念為 miRNA 透過 seed region(第 2–8 個核苷酸)與 mRNA 3′UTR 之 Watson-Crick 配對進行辨識與調控。該演算法結合 seed match、RNA 結構穩定度(free energy)、結合位點數量與跨物種保守性等資訊,以預測具有生物學意義的 miRNA–target interaction。後續版本進一步導入 context score、site accessibility 與 UTR 特徵等評估指標,並整合多物種基因組資料,以提高預測準確度與生物學相關性。TargetScan 特別強調保守性(evolutionary conservation)在 miRNA 標的辨識中的重要性,因此在功能性研究與機制探索中具有高度參考價值。
Lewis, B. P., Shih, I. H., Jones-Rhoades, M. W., Bartel, D. P., & Burge, C. B. (2003). Prediction of mammalian microRNA targets. Cell, 115(7), 787–798. https://doi.org/10.1016/s0092-8674(03)01018-3
3.2 miRWalk
miRWalk 提供完整的 miRNA–gene interaction 預測與整合資料,涵蓋 mRNA 全長區域(包括 3′UTR、5′UTR 與 coding sequence)。平台整合多種預測演算法與已發表研究資料,並提供疾病關聯與功能分析資訊,有助於進行 miRNA 調控網路與生物功能探索。
Dweep, H., Sticht, C., Pandey, P., & Gretz, N. (2011). miRWalk–database: prediction of possible miRNA binding sites by “walking” the genes of three genomes. Journal of biomedical informatics, 44(5), 839–847. https://doi.org/10.1016/j.jbi.2011.05.002
3.3 miRDB
miRDB 是以機器學習模型(machine learning-based algorithm)建立的 miRNA 標的預測資料庫,透過大量實驗資料訓練模型來預測 miRNA–target interaction。其預測結果以 score 呈現,可用於篩選高可信度候選基因,常被應用於 miRNA 功能研究與候選標的篩選。
Yuhao Chen, Xiaowei Wang, miRDB: an online database for prediction of functional microRNA targets, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D127–D131, https://doi.org/10.1093/nar/gkz757
4 miRNA 資料庫
4.1 miRTarBase
miRTarBase(若連結無法開啟,請複製網址並直接於瀏覽器開啟) 是一個專門收錄經實驗驗證(experimentally validated)之 microRNA–target interaction(MTI) 的資料庫
Shidong Cui, Sicong Yu, Hsi-Yuan Huang, Yang-Chi-Dung Lin, Yixian Huang, Bojian Zhang, Jihan Xiao, Huali Zuo, Jiayi Wang, Zhuoran Li, Guanghao Li, Jiajun Ma, Baiming Chen, Haoxuan Zhang, Jiehui Fu, Liang Wang, Hsien-Da Huang, miRTarBase 2025: updates to the collection of experimentally validated microRNA–target interactions, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D147–D156, https://doi.org/10.1093/nar/gkae1072
5 分析工具整理
5.1 Tools4miRs
Tools4miRs 是一個很詳盡的 miRNA 分析方法整理平台,彙總了幾乎所有處理 miRNA 可能會用到工具,並詳細比較各個工具的特性,非常值得參考。
Anna Lukasik, Maciej Wójcikowski, Piotr Zielenkiewicz, Tools4miRs – one place to gather all the tools for miRNA analysis, Bioinformatics, Volume 32, Issue 17, September 2016, Pages 2722–2724, https://doi.org/10.1093/bioinformatics/btw189