使用Spark時(shí)的一些常見問題和解決辦法

發(fā)布時(shí)間：2021-09-13 17:01:31 來源：億速云閱讀：246 作者：chen 欄目：編程語言

這篇文章主要講解了“使用Spark時(shí)的一些常見問題和解決辦法”，文中的講解內(nèi)容簡單清晰，易于學(xué)習(xí)與理解，下面請(qǐng)大家跟著小編的思路慢慢深入，一起來研究和學(xué)習(xí)“使用Spark時(shí)的一些常見問題和解決辦法”吧！

1、首先來說說spark任務(wù)運(yùn)行完后查錯(cuò)最常用的一個(gè)命令，那就是把任務(wù)運(yùn)行日志down下來。
程序存在錯(cuò)誤，將日志down下來查看具體原因!down日志命令：yarn logs -applicationId app_id

2、Spark性能優(yōu)化的9大問題及其解決方案

Spark程序優(yōu)化所需要關(guān)注的幾個(gè)關(guān)鍵點(diǎn)——最主要的是數(shù)據(jù)序列化和內(nèi)存優(yōu)化

問題1：reduce task數(shù)目不合適

解決方法：需根據(jù)實(shí)際情況調(diào)節(jié)默認(rèn)配置，調(diào)整方式是修改參數(shù)spark.default.parallelism。通常，reduce數(shù)目設(shè)置為core數(shù)目的2到3倍。數(shù)量太大，造成很多小任務(wù)，增加啟動(dòng)任務(wù)的開銷;數(shù)目太少，任務(wù)運(yùn)行緩慢。

問題2：shuffle磁盤IO時(shí)間長

解決方法：設(shè)置spark.local.dir為多個(gè)磁盤，并設(shè)置磁盤為IO速度快的磁盤，通過增加IO來優(yōu)化shuffle性能;

問題3：map|reduce數(shù)量大，造成shuffle小文件數(shù)目多

解決方法：默認(rèn)情況下shuffle文件數(shù)目為map tasks * reduce tasks. 通過設(shè)置spark.shuffle.consolidateFiles為true，來合并shuffle中間文件，此時(shí)文件數(shù)為reduce tasks數(shù)目;

問題4：序列化時(shí)間長、結(jié)果大

解決方法：Spark默認(rèn)使.用JDK.自帶的ObjectOutputStream，這種方式產(chǎn)生的結(jié)果大、CPU處理時(shí)間長，可以通過設(shè)置spark.serializer為org.apache.spark.serializer.KryoSerializer。另外如果結(jié)果已經(jīng)很大，可以使用廣播變量;

問題5：單條記錄消耗大

解決方法：使用mapPartition替換map，mapPartition是對(duì)每個(gè)Partition進(jìn)行計(jì)算，而map是對(duì)partition中的每條記錄進(jìn)行計(jì)算;

問題6：collect輸出大量結(jié)果時(shí)速度慢

解決方式：collect源碼中是把所有的結(jié)果以一個(gè)Array的方式放在內(nèi)存中，可以直接輸出到分布式?文件系統(tǒng)，然后查看文件系統(tǒng)中的內(nèi)容;

問題7：任務(wù)執(zhí)行速度傾斜

解決方式：如果是數(shù)據(jù)傾斜，一般是partition key取的不好，可以考慮其它的并行處理方式，并在中間加上aggregation操作;如果是Worker傾斜，例如在某些worker上的executor執(zhí)行緩慢，可以通過設(shè)置spark.speculation=true 把那些持續(xù)慢的節(jié)點(diǎn)去掉;

問題8：通過多步驟的RDD操作后有很多空任務(wù)或者小任務(wù)產(chǎn)生

解決方式：使用coalesce或repartition去減少RDD中partition數(shù)量;

問題9：Spark Streaming吞吐量不高

解決方式：可以設(shè)置spark.streaming.concurrentJobs

3、intellij idea直接編譯spark源碼及問題解決:

http://blog.csdn.net/tanglizhe1105/article/details/50530104

http://stackoverflow.com/questions/18920334/output-path-is-shared-between-the-same-module-error

Spark編譯：clean package -Dmaven.test.skip=true

參數(shù)：-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

4、import Spark source code into intellj, build Error:

not found: type SparkFlumeProtocol and EventBatch

http://stackoverflow.com/questions/33311794/import-spark-source-code-into-intellj-build-error-not-found-type-sparkflumepr

spark_complie_config.png

5、org.apache.spark.SparkException: Exception thrown in awaitResult

set "spark.sql.broadcastTimeout" to increase the timeout

6、Apache Zeppelin編譯安裝：

Apache Zeppelin installation grunt build error：

解決方案：進(jìn)入web模塊npm install;

http://stackoverflow.com/questions/33352309/apache-zeppelin-installation-grunt-build-error?rq=1

7、Spark源碼編譯遇到的問題解決： http://www.tuicool.com/articles/NBVvai

內(nèi)存不夠，這個(gè)錯(cuò)誤是因?yàn)榫幾g的時(shí)候內(nèi)存不夠?qū)е碌?，可以在編譯的時(shí)候加大內(nèi)存。

[ERROR] PermGen space -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors,re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,

please read the following articles:

[ERROR] [Help 1]http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

8、Exception in thread "main" java.lang.UnsatisfiedLinkError: no jnind4j in java.library.path

解決方案：I’m using a 64-Bit Java on Windows and still get the no jnind4j in java.library.path error It may be that you have incompatible DLLs on your PATH. In order to tell DL4J to ignore those you have to add the following as a VM parameter (Run -> Edit Configurations -> VM Options in IntelliJ): -Djava.library.path=""

9、spark2.0本地運(yùn)行源碼報(bào)錯(cuò)解決辦法：

修改對(duì)應(yīng)pom中的依賴jar包，將scope級(jí)別由provided改為compile

運(yùn)行類之前，去掉make選項(xiàng);在運(yùn)行vm設(shè)置中增加-Dspark.master=local

Win7下運(yùn)行spark example代碼報(bào)錯(cuò)：

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file 使用Spark時(shí)的一些常見問題和解決辦法 :/SourceCode/spark-2.0.0/spark-warehouse修改SQLConf類中WAREHOUSE_PATH變量，將file:前綴改為file:/或file:///

createWithDefault("file:/${system:user.dir}/spark-warehouse")

local模式運(yùn)行：-Dspark.master=local

10、解決Task not serializable Exception錯(cuò)誤

方法1：將RDD中的所有數(shù)據(jù)通過JDBC連接寫入數(shù)據(jù)庫，若使用map函數(shù)，可能要為每個(gè)元素都創(chuàng)建connection，這樣開銷很大，如果使用mapPartitions，那么只需要針對(duì)每個(gè)分區(qū)建立connection;mapPartitions處理后返回的是Iterator。

方法2：對(duì)未序列化的對(duì)象加@transisent引用，在進(jìn)行網(wǎng)絡(luò)通信時(shí)不對(duì)對(duì)象中的屬性進(jìn)行序列化

11、這個(gè)函數(shù)在func("11")調(diào)用時(shí)候正常,但是在執(zhí)行func(11)或func(1.1)時(shí)候就會(huì)報(bào)error: type mismatch的錯(cuò)誤. 這個(gè)問題很好解決

針對(duì)特定的參數(shù)類型, 重載多個(gè)func函數(shù),這個(gè)不難, 傳統(tǒng)JAVA中的思路, 但是需要定義多個(gè)函數(shù)

使用超類型, 比如使用AnyVal,Any;這樣的話比較麻煩,需要在函數(shù)中針對(duì)特定的邏輯做類型轉(zhuǎn)化,從而進(jìn)一步處理上面兩個(gè)方法使用的是傳統(tǒng)JAVA思路,雖然都可以解決該問題,但是缺點(diǎn)是不夠簡潔;在充滿了語法糖的Scala中,針對(duì)類型轉(zhuǎn)換提供了特有的implicit隱式轉(zhuǎn)化的功能;

12、org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

解決方案：這種問題一般發(fā)生在有大量shuffle操作的時(shí)候,task不斷的failed,然后又重執(zhí)行，一直循環(huán)下去，直到application失敗。一般遇到這種問題提高executor內(nèi)存即可,同時(shí)增加每個(gè)executor的cpu,這樣不會(huì)減少task并行度。

13、Spark ML PipeLine GBT/RF預(yù)測時(shí)報(bào)錯(cuò)，java.util.NoSuchElementException: key not found: 8.0

錯(cuò)誤原因：由于GBT/RF模型輸入setFeaturesCol，setLabelCol參數(shù)列名不一致導(dǎo)致。

解決方案：只保存訓(xùn)練算法模型，不保存PipeLineModel

14、linux刪除亂碼文件，step1. ls -la; step2. find . -inum inode num -exec rm {} -rf \;

15、Caused by: java.lang.RuntimeException: Failed to commit task Caused by: org.apache.spark.executor.CommitDeniedException: attempt_201603251514_0218_m_000245_0: Not committed because the driver did not authorize commit

如果你比較了解spark中的stage是如何劃分的，這個(gè)問題就比較簡單了。一個(gè)Stage中包含的task過大，一般由于你的transform過程太長，因此driver給executor分發(fā)的task就會(huì)變的很大。所以解決這個(gè)問題我們可以通過拆分stage解決。也就是在執(zhí)行過程中調(diào)用cache.count緩存一些中間數(shù)據(jù)從而切斷過長的stage。

感謝各位的閱讀，以上就是“使用Spark時(shí)的一些常見問題和解決辦法”的內(nèi)容了，經(jīng)過本文的學(xué)習(xí)后，相信大家對(duì)使用Spark時(shí)的一些常見問題和解決辦法這一問題有了更深刻的體會(huì)，具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是億速云，小編將為大家推送更多相關(guān)知識(shí)點(diǎn)的文章，歡迎關(guān)注！

向AI問一下細(xì)節(jié)

使用Spark時(shí)的一些常見問題和解決辦法

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽