溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

R語言 TCGAbiolinks包的參數(shù)有哪些

發(fā)布時間:2022-03-21 10:08:58 來源:億速云 閱讀:710 作者:iii 欄目:開發(fā)技術(shù)

今天小編給大家分享一下R語言 TCGAbiolinks包的參數(shù)有哪些的相關(guān)知識點,內(nèi)容詳細(xì),邏輯清晰,相信大部分人都還太了解這方面的知識,所以分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后有所收獲,下面我們一起來了解一下吧。

1.包的安裝:

 local({r <- getOption("repos")  
 r["CRAN"] <- "http://mirrors.tuna.tsinghua.edu.cn/CRAN/"   
options(repos=r)}) 

if (!requireNamespace("BiocManager", quietly=TRUE)){
    install.packages("BiocManager")
}
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
BiocManager::install("TCGAbiolinks")
library(TCGAbiolinks)

2.利用TCGAbiolinks下載數(shù)據(jù)

下載數(shù)據(jù)分為三步,分別用到TCGAbiolinks包中三個函數(shù):

1)查詢數(shù)據(jù)  GDCquery()

2)下載數(shù)據(jù)  getResults()

3)保存整理數(shù)據(jù) GDCprepare()

以上三步中重點介紹第一個GDCquery()使用方法,其參數(shù)最多12個,而且每個參數(shù)可設(shè)置的選項也非常多,剩下兩個函數(shù),使用相對簡單了。以下為使用方法和參數(shù)說明:

GDCquery(project, data.category, data.type, workflow.type,
  legacy = FALSE, access, platform, file.type, barcode, data.format,
  experimental.strategy, sample.type)

簡單的使用舉例:

query <- GDCquery(project = "TCGA-ACC",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment")
GDCquery參數(shù)說明:

1.project

可以通過getGDCprojects()$project_id,獲取TCGA中最新的不同癌種的項目號,更新項目信息對應(yīng)癌癥名稱:https://www.億速云.com/article/1061

> getGDCprojects()$project_id
 [1] "TCGA-MESO"             "TCGA-READ"             "TCGA-SARC"            
 [4] "TCGA-ACC"              "TCGA-LGG"              "TCGA-THCA"            
 [7] "TARGET-CCSK"           "TARGET-NBL"            "BEATAML1.0-CRENOLANIB"
[10] "TARGET-AML"            "TCGA-SKCM"             "TCGA-CHOL"            
[13] "TCGA-KIRC"             "TCGA-BRCA"             "VAREPOP-APOLLO"       
[16] "HCMI-CMDC"             "ORGANOID-PANCREATIC"   "TCGA-GBM"             
[19] "TCGA-OV"               "FM-AD"                 "TCGA-UCEC"            
[22] "TARGET-ALL-P3"         "CGCI-BLGSP"            "TARGET-ALL-P2"        
[25] "TCGA-LAML"             "TCGA-DLBC"             "TCGA-KICH"            
[28] "TCGA-THYM"             "TCGA-UVM"              "TCGA-PRAD"            
[31] "TCGA-LUSC"             "TCGA-TGCT"             "CPTAC-3"              
[34] "BEATAML1.0-COHORT"     "TCGA-STAD"             "TCGA-LIHC"            
[37] "TCGA-COAD"             "TARGET-OS"             "TARGET-RT"            
[40] "CTSP-DLBCL1"           "TCGA-HNSC"             "TCGA-ESCA"            
[43] "TCGA-CESC"             "TCGA-PCPG"             "TCGA-KIRP"            
[46] "TCGA-UCS"              "TCGA-PAAD"             "TCGA-LUAD"            
[49] "TARGET-WT"             "MMRF-COMMPASS"         "TCGA-BLCA"            
[52] "NCICCR-DLBCL"          "TARGET-ALL-P1"

2.data.category

可以使用TCGAbiolinks:::getProjectSummary(project)查看project中有哪些數(shù)據(jù)類型,如查詢"TCGA-ACC",有7種數(shù)據(jù)類型,case_count為病人數(shù),file_count為對應(yīng)的文件數(shù)。下載表達譜,可以設(shè)置data.category="Transcriptome Profiling":

> TCGAbiolinks:::getProjectSummary("TCGA-ACC")
$data_categories
  case_count file_count               data_category
1         80        397     Transcriptome Profiling
2         92        361       Copy Number Variation
3         92        744 Simple Nucleotide Variation
4         80         80             DNA Methylation
5         92        105                    Clinical
6         92        352            Sequencing Reads
7         92        517                 Biospecimen
$case_count
[1] 92
$file_count
[1] 2556
$file_size
[1] 3.920606e+12

3.data.type

這個參數(shù)受到上一個參數(shù)的影響,不同的data.category,會有不同的data.type,如下表所示:

如果下載表達數(shù)據(jù),常用的設(shè)置如下:
 #下載rna-seq轉(zhuǎn)錄組的表達數(shù)據(jù)
 data.type = "Gene Expresion Quantification"
 #下載miRNA表達數(shù)據(jù)數(shù)據(jù)
 data.type = "miRNA Expression Quantification"
 #下載Copy Number Variation數(shù)據(jù)
 data.type = "Copy Number Segment"

4.workflow.type

這個參數(shù)受到上兩個參數(shù)的影響,不同的data.category和不同的data.type,會有不同的workflow.type

5 legacy

這個參數(shù)主要是設(shè)置TCGA數(shù)據(jù)有兩不同入口可以下載,GDC Legacy Archive 和 GDC Data Portal,以下是官方的解釋兩種數(shù)據(jù)Legacy or Harmonized區(qū)別:大致意思為:Legacy 數(shù)據(jù)hg19和hg18為參考基因組(老數(shù)據(jù))而且已經(jīng)不再更新了,Harmonized數(shù)據(jù)以hg38為參考基因組的數(shù)據(jù)(新數(shù)據(jù)),現(xiàn)在一般選擇Harmonized。

Different sources: Legacy vs Harmonized
There are two available sources to download GDC data using TCGAbiolinks:
  • GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh47 (hg19) and GRCh46 (hg18).

  • GDC harmonized database: data available was harmonized against GRCh48 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data.

可以設(shè)置為TRUE或者FALSE:
Harmonized data options (legacy = FALSE)
Legacy archive data options (legacy = TRUE)

不同的的數(shù)據(jù)(新老Legacy or Harmonized),里面存儲的數(shù)據(jù)會有差異,會影響前面data.category、 data.type 、 前面三個參數(shù)可以設(shè)置的值如下:

6 access

Filter by access type. Possible values: controlled, open,篩選數(shù)據(jù)是否開放,這個一般不用設(shè)置,不開放的數(shù)據(jù)也沒必要了,所以都設(shè)置成:access=“open"

7.platform

涉及到數(shù)據(jù)來源的平臺,如芯片數(shù)據(jù),甲基化數(shù)據(jù)等等平臺的篩選,一般不做設(shè)置,除非要篩選特定平臺的數(shù)據(jù):

Example:


CGH- 1x1M_G4447AIlluminaGA_RNASeqV2

AgilentG4502A_07IlluminaGA_mRNA_DGE

Human1MDuoHumanMethylation450

HG-CGH-415K_G4124AIlluminaGA_miRNASeq

HumanHap550IlluminaHiSeq_miRNASeq

ABIH-miRNA_8x15K

HG-CGH-244ASOLiD_DNASeq

IlluminaDNAMethylation_OMA003_CPIIlluminaGA_DNASeq_automated

IlluminaDNAMethylation_OMA002_CPIHG-U133_Plus_2

HuEx- 1_0-st-v2Mixed_DNASeq

H-miRNA_8x15Kv2IlluminaGA_DNASeq_curated

MDA_RPPA_CoreIlluminaHiSeq_TotalRNASeqV2

HT_HG-U133AIlluminaHiSeq_DNASeq_automated

diagnostic_imagesmicrosat_i

IlluminaHiSeq_RNASeqSOLiD_DNASeq_curated

IlluminaHiSeq_DNASeqCMixed_DNASeq_curated

IlluminaGA_RNASeqIlluminaGA_DNASeq_Cont_automated

IlluminaGA_DNASeqIlluminaHiSeq_WGBS

pathology_reportsIlluminaHiSeq_DNASeq_Cont_automated

Genome_Wide_SNP_6bio

tissue_imagesMixed_DNASeq_automated

HumanMethylation27Mixed_DNASeq_Cont_curated

IlluminaHiSeq_RNASeqV2Mixed_DNASeq_Cont

8 file.type

這個參數(shù)不用設(shè)置

9 barcode

A list of barcodes to filter the files to download,可以指定要下載的樣品,例如:

barcode =c"TCGA-14-0736-02A-01R-2005-01""TCGA-06-0211-02A-02R-2005-01"

10 data.format

可以設(shè)置的選項為不同格式的文件: ("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML", "XLSX"),通常情況下不用設(shè)置,默認(rèn)就行;

11 experimental.strategy

用于過濾不同的實驗方法得到的數(shù)據(jù):

Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.

Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq

12 sample.type

對樣本的類型進行過濾,例如,原發(fā)癌組織,復(fù)發(fā)癌等等;

學(xué)習(xí)完成了所有的參數(shù),這里也有舉例使用:

query <- GDCquery(project = "TCGA-ACC",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment")
## Not run: 
query <- GDCquery(project = "TARGET-AML",
                  data.category = "Transcriptome Profiling",
                  data.type = "miRNA Expression Quantification",
                  workflow.type = "BCGSC miRNA Profiling",
                  barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R"))
query <- GDCquery(project = "TARGET-AML",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R"))
query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy Number Variation",
                  data.type = "Masked Copy Number Segment",
                  sample.type = c("Primary solid Tumor"))
query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      platform = "Illumina Human Methylation 450")
query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "hg19.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01"))

下載數(shù)據(jù)  GDCdownload()

上面的GDCquery()命令完成之后我們就可以用GDCdownload()函數(shù)下載數(shù)據(jù)了,如果數(shù)據(jù)很多,如果中間中斷可以重復(fù)運行GDCdownload()函數(shù)繼續(xù)下載,直到所有的數(shù)據(jù)下載完成,使用舉例如下:

query <-GDCquery(project = "TCGA-GBM",                           data.category = "Gene expression",                           data.type = "Gene expression quantification",                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",                           experimental.strategy = "RNA-Seq",                           barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),                           legacy = TRUE)GDCdownload(query, method = "client", files.per.chunk = 10, directory="D:/data")

具體參數(shù)說明如下,主要設(shè)置的參數(shù):

  1. method如果設(shè)置為client 需要將gdc-client軟件所在的路徑添加到環(huán)境變量中,參考:gdc-client下載TCGA數(shù)據(jù);

  2. query,為GDCquery查詢的結(jié)果,

  3. files.per.chunk = 10,設(shè)置同時下載的數(shù)量,如果網(wǎng)速慢建議設(shè)置的小一些,

  4. directory="D:/data"  數(shù)據(jù)存儲的路徑;

整理數(shù)據(jù)  GDCprepare()

GDCprepare可以自動的幫我們獲得基因表達數(shù)據(jù):

data <- GDCprepare(query = query,                    
                   save = TRUE, 
                   directory =  "D:/data",   #注意和GDCdownload設(shè)置的路徑一致GDCprepare才可以找到下載的數(shù)據(jù)然后去處理。    
                   save.filename = "GBM.RData")   #存儲一下,方便下載直接讀取

獲得了data數(shù)據(jù)之后,就可以往下數(shù)據(jù)挖掘了

以上就是“R語言 TCGAbiolinks包的參數(shù)有哪些”這篇文章的所有內(nèi)容,感謝各位的閱讀!相信大家閱讀完這篇文章都有很大的收獲,小編每天都會為大家更新不同的知識,如果還想學(xué)習(xí)更多的知識,請關(guān)注億速云行業(yè)資訊頻道。

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI