基因家族docker鏡像怎么使用

發(fā)布時間：2022-03-21 10:06:36 來源：億速云閱讀：154 作者：iii 欄目：開發(fā)技術(shù)
這篇文章主要講解了“基因家族docker鏡像怎么使用”，文中的講解內(nèi)容簡單清晰，易于學(xué)習(xí)與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學(xué)習(xí)“基因家族docker鏡像怎么使用”吧！
基因家族docker 鏡像使用方法
#################################################
#進(jìn)入docker虛擬機，注意自己安裝的docker版本
#################################################
#下載基因家族分析鏡像
#docker pull 億速云/gene-family:v1.0
#docker desktop 進(jìn)入虛擬機命令
#docker run -m 3G --cpus 2 --rm -v D:/gene-family:/work -it 億速云/gene-family:v1.0
#docker toolbox 進(jìn)入虛擬機命令
#docker run -m 3G --cpus 2 --rm -v /d/gene-family:/work -it 億速云/gene-family:v1.0
#############################################################################
#數(shù)據(jù)準(zhǔn)備
########################################################################
#下載擬南芥基因組信息
#wget ftp://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
#wget ftp://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/arabidopsis_thaliana/cds/Arabidopsis_thaliana.TAIR10.cds.all.fa.gz
#wget ftp://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz
#wget ftp://ftp.ensemblgenomes.org/pub/plants/release-41/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.41.gff3.gz
#
#解壓壓縮文件
#gunzip *gz
#觀察數(shù)據(jù)，ID保持一致,也就是gff中第9列，ID標(biāo)簽和parent標(biāo)簽與蛋白序列和cds序列里面的ID是否一致；
#處理GFF 文件里面ID中一些不必要的信息，gene:  transcript: 刪除；與蛋白質(zhì)中的ID保持一致：Arabidopsis_thaliana.TAIR10.pep.all.fa 
#sed 's#gene:##' Arabidopsis_thaliana.TAIR10.41.gff3|sed  's#transcript:##' |sed 's#CDS:##' >Arabidopsis_thaliana.TAIR10.41.gff31
#mv Arabidopsis_thaliana.TAIR10.41.gff31 Arabidopsis_thaliana.TAIR10.41.gff3
#獲取基因與mRNA的對應(yīng)關(guān)系，后面會用到；
perl script/mRNAid_to_geneid.pl Arabidopsis_thaliana.TAIR10.41.gff3 mRNA2geneID.txt
perl script/geneid_to_mRNAid.pl Arabidopsis_thaliana.TAIR10.41.gff3 geneid2mRNAid.txt
###################################################################################333
#第一次搜索結(jié)構(gòu)域
####################################################################################
hmmsearch --domtblout WRKY_hmm_out.txt --cut_tc WRKY.hmm Arabidopsis_thaliana.TAIR10.pep.all.fa
##################################################################
#第二次搜索結(jié)構(gòu)域  可選分析
#提取結(jié)構(gòu)域序列，最后的evalue值根據(jù)實際情況可調(diào),注意腳本提取的是第一個domain，如要提取其他domain，請修改腳本27行$a[9]==1為第一個，$a[9]==2為第二個，依次類推
perl script/domain_xulie.pl WRKY_hmm_out.txt Arabidopsis_thaliana.TAIR10.pep.all.fa WRKY_domain.fa 1.2e-28
###########以下部分為建立物種特異模型再次搜索，可根據(jù)自己基因家族情況選做這部分內(nèi)容#############################
#clusterW多序列比對快捷方法
echo -e '1\nWRKY_domain.fa\n2\n1\nWRKY_domain.aln\nWRKY_domain.dnd\nX\n\n\nX\n' |clustalw
#利用比對結(jié)果建立物種特異hmm模型
hmmbuild WRKY_domain_new.hmm WRKY_domain.aln
#新建物種特異hmm模型，再次搜索
hmmsearch --domtblout WRKY_domain_new_out.txt --cut_tc WRKY_domain_new.hmm Arabidopsis_thaliana.TAIR10.pep.all.fa
############################################################################################################
#篩選EValue  <0.001 hmm搜索結(jié)果，可以用excel手動篩選，篩選標(biāo)準(zhǔn)，
#如果只想用hmmer搜索一次，可將下面的文件：WRKY_domain_new_out.txt 替換成第一次hmmer搜索生成的文件：WRKY_hmm_out.txt
grep -v "^#" WRKY_domain_new_out.txt|awk '$7<0.001 {print}' >WRKY_domain_new_out_selected.txt

##############################################################################
#一個基因有多個轉(zhuǎn)錄本，去除重復(fù)
##############################################################################
#去除重復(fù)的hmmer搜索的轉(zhuǎn)錄本ID，多個轉(zhuǎn)錄本ID保留一個作為基因的代表，此步建議對腳本輸出的文件手動篩選，挑選ID：
perl script/select_redundant_mRNA.pl mRNA2geneID.txt WRKY_domain_new_out_selected.txt WRKY_remove_redundant_IDlist.txt
#請手動挑選完mRNA的ID放在第一列，也就是挑選一個轉(zhuǎn)錄本ID代表這個基因，存成新的文件WRKY_removed_redundant_IDlist.txt：
#利用腳本得到對應(yīng)基因的蛋白序列：
perl script/get_fa_by_id.pl WRKY_removed_redundant_IDlist.txt Arabidopsis_thaliana.TAIR10.pep.all.fa WRKY_pep_need_to_confirm.fa
##################################################
#結(jié)構(gòu)域在在線數(shù)據(jù)庫中確認(rèn)
#################################################
#將上面WRKY_pep_need_to_confirm.fa文件中的蛋白序列，再手動驗證一下，把不需要的ID刪除，最終確認(rèn)的ID存成新文件：WRKY_removed_redundant_and_confirmed_IDlist.txt
#手動確認(rèn)結(jié)構(gòu)域，CDD，SMART，PFAM
#確定分子量大?。篽ttp://web.expasy.org/protparam/
#perl script/stat_protein_fa.pl WRKY_pep_need_to_confirm.fa WRKY_pep_need_to_confirm.MW.txt
#三大數(shù)據(jù)庫網(wǎng)站，篩選之后去除一些不確定的基因ID，最終得到可靠的基因家族基因列表,存儲在文件：WRKY_removed_redundant_and_confirmed_IDlist.txt ; 
##################################################################
#篩選整理數(shù)據(jù)，用于后續(xù)分析；腳本提取hmm結(jié)果文件，重新篩選一下hmm的結(jié)果,提取結(jié)構(gòu)域序列，蛋白全長，cds全長，用于后續(xù)分析
#################################################################
#get_data_by_id.pl 會讀取第一個文件的第一列ID，然后到第二個文件中去篩選對應(yīng)ID（第二個文件第一列作為ID）的數(shù)據(jù)輸出到第三個文件中
perl script/get_data_by_id.pl WRKY_removed_redundant_and_confirmed_IDlist.txt WRKY_domain_new_out_selected.txt WRKY_domain_new_out_removed_redundant.txt
#截取得到序列上的保守結(jié)構(gòu)域序列，注意腳本提取的是第一個domain，如要提取其他domain，請修改腳本27行$a[9]==1為第一個，$a[9]==2為第二個，依次類推
perl script/domain_xulie.pl WRKY_domain_new_out_removed_redundant.txt Arabidopsis_thaliana.TAIR10.pep.all.fa WRKY_domain_confirmed.fa 0.1
#得到對應(yīng)基因的蛋白序列全長，腳本會讀取第一個文件的第一列的ID信息，根據(jù)ID提取相應(yīng)的序列：
perl script/get_fa_by_id.pl WRKY_removed_redundant_and_confirmed_IDlist.txt Arabidopsis_thaliana.TAIR10.pep.all.fa WRKY_pep_confirmed.fa
#得到對應(yīng)基因的cds序列，腳本會讀取第一個文件的第一列的ID信息，根據(jù)ID提取相應(yīng)的序列：
perl script/get_fa_by_id.pl WRKY_removed_redundant_and_confirmed_IDlist.txt Arabidopsis_thaliana.TAIR10.cds.all.fa WRKY_cds_confirmed.fa

########################進(jìn)化樹分析##########################################
#cd $workdir 回到工作路徑
mkdir gene_tree_analysis
cd gene_tree_analysis
cp ../WRKY_domain_confirmed.fa .
cp ../WRKY_pep_confirmed.fa .
cp ../WRKY_cds_confirmed.fa .
cp ../WRKY_domain_new_out_removed_redundant.txt .

#########################利用meme軟件做motif分析################################33
#回到工作路徑 my_gene_family
mkdir meme_motif_analysis
cd meme_motif_analysis
#搜索結(jié)構(gòu)域：
#-nmotifs 10  搜索motif的總個數(shù)
#-minw 6   motif的最短長度
#-maxw 50   motif的最大長度
meme ../WRKY_pep_confirmed.fa -protein -oc ./ -nostatus -time 18000 -maxsize 6000000 -mod anr -nmotifs 10 -minw 6 -maxw 100

##################################基因結(jié)構(gòu)分析structure####################
#回到工作路徑 my_gene_family
mkdir gene_structure_analysis
cd gene_structure_analysis
cp ../WRKY_removed_redundant_and_confirmed_IDlist.txt .


#獲得基因的在染色體上的外顯子，CDS，UTR位置信息，用于繪制基因結(jié)構(gòu)圖
#注意腳本讀取的是第一個（-in1）文件第一列信息，里面是轉(zhuǎn)錄本ID；然后把轉(zhuǎn)錄本對應(yīng)的位置cds utr等信息篩選出來
perl ../script/get_gene_exon_from_gff.pl -in1 WRKY_removed_redundant_and_confirmed_IDlist.txt -in2 ../Arabidopsis_thaliana.TAIR10.41.gff3 -out gene_exon_info.gff

################################基因定位到染色體###############################################
# 回到工作路徑  my_gene_family
mkdir map_to_chr
cd map_to_chr
cp ../WRKY_removed_redundant_and_confirmed_IDlist.txt .    #WRKY基因家族ID列表文件

#獲得基因的在染色體上的位置信息，用于繪制染色體位置圖,注意提取的是基因位置還是mRNA位置,以下代碼是提取的 mRNA位置
perl ../script/get_gene_weizhi.pl -in1 WRKY_removed_redundant_and_confirmed_IDlist.txt -in2 ../Arabidopsis_thaliana.TAIR10.41.gff3 -out mrna_location.txt
#獲得基因組染色體長度：
samtools faidx ../Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
cp ../Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.fai .
#繪圖參考：http://www.億速云.com/article/397
#######################基因上游順勢作用原件分析#######################################
#回到工作路徑 my_gene_family
mkdir gene_promoter
cd gene_promoter
cp ../WRKY_removed_redundant_and_confirmed_IDlist.txt .
#得到基因在染色體上的位置，此腳本會把基因組所有的序列讀入內(nèi)存，如果基因組較大，可能因為內(nèi)存不足使腳本運行不成功，可以分染色體分開分析：
perl ../script/get_gene_weizhi.pl -in1 WRKY_removed_redundant_and_confirmed_IDlist.txt -in2 ../Arabidopsis_thaliana.TAIR10.41.gff3 -out mrna_location.txt
#根據(jù)位置信息提取，promoter序列 1500
perl ../script/get_promoter.pl ../Arabidopsis_thaliana.TAIR10.dna.toplevel.fa mrna_location.txt promoter.fa
#######################共線性分析，基因加倍與串聯(lián)重復(fù)分析MCScanX##################################
#回到工作路徑  my_gene_family
mkdir MCScanX
cd MCScanX
mkdir mcscan
#獲取基因?qū)?yīng)的蛋白序列信息，注意：基因的第一個轉(zhuǎn)錄本為代表序列；
#從geneid2mRNAid.txt文件中每個基因挑一個轉(zhuǎn)錄本ID存儲到allmRNAID.txt（一列）
#獲取基因組基因?qū)?yīng)轉(zhuǎn)錄本的位置信息
perl ../script/get_mRNA_position.pl allmRNAID.txt ../Arabidopsis_thaliana.TAIR10.41.gff3 AT.gff
#獲取對應(yīng)蛋白序列
perl ../script/get_fa_by_id.pl allmRNAID.txt ../Arabidopsis_thaliana.TAIR10.pep.all.fa pep.fa
#blast建庫
makeblastdb -in pep.fa  -dbtype prot -title pep.fa
#blastall  比對，所有基因?qū)λ谢?
blastall -i pep.fa -d pep.fa -e 1e-10  -p blastp  -b 5 -v 5 -m 8 -o mcscan/AT.blast
cp AT.gff mcscan/AT.gff

#運行MCSCANX  查找共線性
MCScanX mcscan/AT
#下載示例文件，自己分析時需要用自己研究物種的染色體文件，和前面鑒定基因家族基因列表文件
#wget http://chibba.pgml.uga.edu/mcscan2/examples/family.ctl
#wget http://chibba.pgml.uga.edu/mcscan2/examples/MADS_box_family.txt
sed -i 's#at##g' family.ctl
#基因家族共線性繪圖，注意MCSCAN繪圖要回到：/biosoft/MCScanX/MCScanX/downstream_analyses 才能繪圖
cd /biosoft/MCScanX/MCScanX/downstream_analyses 
java family_circle_plotter -g /work/my_gene_family/MCScanX/mcscan/AT.gff -s /work/my_gene_family/MCScanX/mcscan/AT.collinearity -c /work/my_gene_family/MCScanX/family.ctl -f /work/my_gene_family/MCScanX/MADS_box_family.txt -o /work/my_gene_family/MCScanX/WRKY.circle.PNG
#找到我們分析的基因家族的共線性分析關(guān)系（大片段復(fù)制現(xiàn)象）：
cd /work/my_gene_family/MCScanX
perl /biosoft/MCScanX/MCScanX/downstream_analyses/detect_collinearity_within_gene_families.pl -i MADS_box_family.txt -d mcscan/AT.collinearity -o WRKY.collinear.pairs
#從MCSCAN分析的結(jié)果文件：AT.tandem 提取串聯(lián)重復(fù)基因可以看，：http://www.億速云.com/article/399
perl ../script/get_tandem_gene.pl -id gene_list.txt -tandem mcscan/AT.tandem -od ./ -name WRKY
########基因加倍分析繪圖，circos###############
#cd $workdir 回到工作路徑
mkdir circos
cd circos
cp ../MCScanX/mcscan/AT.collinearity .   
cp ../MCScanX/WRKY.collinear.pairs .
#準(zhǔn)備circos繪圖數(shù)據(jù)文件，腳本從gff里面獲得位置信息并整理數(shù)據(jù)
perl ../script/colline_v3.pl -gff ../MCScanX/AT.gff -list WRKY.collinear.pairs -colline AT.collinearity -od data -name WRKY
#繪圖，主要準(zhǔn)備config3.txt配置文件，以及染色體長度文件等等。
cd data
circos -conf config3.txt -outputdir ./ -outputfile WRKY

##############################復(fù)制基因kaks分析###############################################################
#回到工作路徑  my_gene_family
mkdir gene_duplication_kaks
cd gene_duplication_kaks
cp ../WRKY_removed_redundant_and_confirmed_IDlist.txt .
cp ../WRKY_cds_confirmed.fa .
echo -e "AT1G66600.1\nAT1G66560.1" >dupid.txt
perl ../script/get_fa_by_id.pl dupid.txt WRKY_cds_confirmed.fa dup_gene_paired1.fa
echo -e "1\ndup_gene_paired1.fa\n2\n1\ndup_gene_paired1.aln\ndup_gene_paired1.dnd\nX\n\n\nX\n" |clustalw
#格式轉(zhuǎn)換axt  如果遇到報錯not equal，可參考：http://www.億速云.com/article/700
AXTConvertor dup_gene_paired1.aln dup_gene_paired1.axt
KaKs_Calculator  -i dup_gene_paired1.axt -o dup_gene_paired1.kaks.result
#分離時間計算：http://www.億速云.com/question/896


##############################不同物種之間的共線性分析#############################################
# 回到工作路徑  my_gene_family
mkdir mcscan_between_genome
cd mcscan_between_genome
#獲取基因?qū)?yīng)的cds序列信息，注意：基因的第一個轉(zhuǎn)錄本為代表序列；
#從geneid2mRNAid.txt文件中每個基因挑一個轉(zhuǎn)錄本ID存儲到allCDSID.txt（一列）
#得到基因組上所有基因的位置信息，bed文件；以及cds序列;
perl ../script/get_mRNA_bed.pl -in1 ../Arabidopsis_thaliana.TAIR10.41.gff3 -in2  allCDSID.txt -out ATH.bed
#獲取基因?qū)?yīng)的cds序列
perl ../script/get_fa_by_id.pl allCDSID.txt ../Arabidopsis_thaliana.TAIR10.cds.all.fa ATH.cds

#同樣的道理準(zhǔn)備，準(zhǔn)備白菜的基因組，bed文件和，cds文件；
#處理ID：
#sed 's#gene:##' Brassica_rapa.Brapa_1.0.41.chr.gff3|sed  's#transcript:##' |sed 's#CDS:##' >Brassica_rapa.Brapa_1.0.41.chr.gff31
#mv Brassica_rapa.Brapa_1.0.41.chr.gff31 Brassica_rapa.Brapa_1.0.41.chr.gff3
perl ../script/mRNAid_to_geneid.pl Brassica_rapa.Brapa_1.0.41.chr.gff3 BrmRNA2geneID.txt
perl ../script/geneid_to_mRNAid.pl Brassica_rapa.Brapa_1.0.41.chr.gff3 Brgeneid2mRNAid.txt

#獲取白菜基因?qū)?yīng)的cds序列信息，注意：基因的第一個轉(zhuǎn)錄本為代表序列；
#從Brgeneid2mRNAid.txt文件中每個基因挑一個轉(zhuǎn)錄本ID存儲到BrallCDSID.txt（一列）
#得到白菜基因組上所有基因的位置信息，bed文件；以及cds序列;
perl ../script/get_mRNA_bed.pl -in1 Brassica_rapa.Brapa_1.0.41.chr.gff3 -in2  BrallCDSID.txt -out rapa.bed
#獲取基因?qū)?yīng)的cds序列
perl ../script/get_fa_by_id.pl BrallCDSID.txt Brassica_rapa.Brapa_1.0.cds.all.fa rapa.cds

#注意：不知道為什么，這個python版本的mcscan如果ID后面有.1 運行會不出結(jié)果，
#所以我們把擬南芥和白菜的ID統(tǒng)一都去除一下.1，這部分根據(jù)實際情況選做
sed -i 's#\.1##' rapa.cds
sed -i 's#\.1##' ATH.cds
sed -i 's#\.1##' rapa.bed
sed -i 's#\.1##' ATH.bed
python -m jcvi.compara.catalog ortholog ATH rapa --cscore=0.7
#對共線性區(qū)域進(jìn)行過濾
python -m jcvi.compara.synteny screen --minsize=0 --minspan=30 --simple ATH.rapa.anchors   ATH.rapa.anchors.new
#繪制共線性圖片：準(zhǔn)備兩個配置文件為輸入文件：
python -m jcvi.graphics.karyotype  --format=pdf  --figsize=15x5 mcscan_seqid mcscan_layout

########################結(jié)合轉(zhuǎn)錄組分析基因家族成員表達(dá)量繪制熱圖########################################
#回到工作路徑  my_gene_family
mkdir rna_seq_heatmap
cd rna_seq_heatmap
cp ../WRKY_removed_redundant_and_confirmed_IDlist.txt .
sed -i 's/\.1//' WRKY_removed_redundant_and_confirmed_IDlist.txt
perl ../script/get_data_by_id.pl WRKY_removed_redundant_and_confirmed_IDlist.txt GSE121407_fpkm.txt gene_family_fpkm.txt
###########################################以下blast為可選分析內(nèi)容########################################################################
#blastp比對尋找基因家族成員，WRKY部分
#NCBI上搜索WRKY蛋白序列：搜索條件：WRKY[title] NOT putative[title] AND plants[filter]
#參考文獻(xiàn)：https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4955-8
#回到工作路徑  my_gene_family
mkdir blast
cd blast
#blast比對首先建庫  #蛋白質(zhì)序列
makeblastdb -in WRKY_NCBI_pep.fasta -dbtype prot -title WRKY_NCBI_pep.fasta   
#
#blastp比對
blastall -i ../Arabidopsis_thaliana.TAIR10.pep.all.fa -d WRKY_NCBI_pep.fasta -p blastp -e 1e-10 -b 1 -v 1 -m 8 -o ncbi_WRKY_blast.out 

#獲得比對上的候選基因，存儲在wrky.fa文件中
perl ../script/get_fa_by_id.pl ncbi_WRKY_blast.out ../Arabidopsis_thaliana.TAIR10.pep.all.fa wrky.fa
#然后將 wrky.fa 提交到 NCBI CDD  SMART 上確認(rèn)，把不存在結(jié)構(gòu)域的基因ID可以刪除
感謝各位的閱讀，以上就是“基因家族docker鏡像怎么使用”的內(nèi)容了，經(jīng)過本文的學(xué)習(xí)后，相信大家對基因家族docker鏡像怎么使用這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是億速云，小編將為大家推送更多相關(guān)知識點的文章，歡迎關(guān)注！
向AI問一下細(xì)節(jié)
基因家族docker鏡像怎么使用

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽