Spark 3.0怎么使用GPU加速

發(fā)布時(shí)間：2021-12-17 10:45:19 來(lái)源：億速云閱讀：822 作者：柒染欄目：大數(shù)據(jù)

今天就跟大家聊聊有關(guān)Spark 3.0怎么使用GPU加速，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

概覽

RAPIDS Accelerator for Apache Spark 使用 GPUs數(shù)據(jù)加速處理，通過(guò) RAPIDS libraries來(lái)實(shí)現(xiàn)。

當(dāng)數(shù)據(jù)科學(xué)家從傳統(tǒng)數(shù)據(jù)分析轉(zhuǎn)向 AI applications以滿足復(fù)雜市場(chǎng)需求的時(shí)候，傳統(tǒng)的CPU-based 處理不再滿足速度與成本的需求?？焖僭鲩L(zhǎng)的 AI 分析需要新的框架來(lái)快速處理數(shù)據(jù)和節(jié)約成本，通過(guò) GPUs來(lái)達(dá)到這個(gè)目標(biāo)。

RAPIDS Accelerator for Apache Spark整合了 RAPIDS cuDF 庫(kù)和 Spark 分布式計(jì)算框架。該RAPIDS Accelerator library又一個(gè)內(nèi)置的加速 shuffle 基于 UCX ，可以配置為 GPU-to-GPU 通訊和RDMA能力。

Spark RAPIDS 下載 v0.4.1

RAPIDS Spark Package
cuDF 11.0 Package
cuDF 10.2 Package
cuDF 10.1 Package

RAPIDS Notebooks

cuML Notebooks
cuGraph Notebooks
CLX Notebooks
cuSpatial Notebooks
cuxfilter Notebooks
XGBoost Notebooks

介紹

這些 notebooks 提供了使用 RAPIDS的例子。設(shè)計(jì)為自包含 runtime version of the RAPIDS Docker Container 和 RAPIDS Nightly Docker Containers and can run on air-gapped systems。可以快速獲得容器然后按照 RAPIDS.ai Getting Started page 進(jìn)行安裝和使用。

用法

獲取最新的notebook repo 更新，運(yùn)行 ./update.sh 或者使用命令：

git submodule update --init --remote --no-single-branch --depth 1

下載 CUDA Installer for Linux Ubuntu 20.04 x86_64

基礎(chǔ)安裝如下：

基本安裝程序

安裝說(shuō)明：

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo dpkg -i cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo apt-key add /var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pubsudo apt-get updatesudo apt-get -y install cuda

該CUDA Toolkit 包含開源項(xiàng)目軟件，可以在 here 找到。
可以在 Installer Checksums 中找到安裝程序和補(bǔ)丁的校驗(yàn)和。

性能 & 成本與收益

Rapids Accelerator for Apache Spark 得益于 GPU 性能的同時(shí)降低了成本。如下： Spark 3.0怎么使用GPU加速 *ETL for FannieMae Mortgage Dataset (~200GB) as shown in our demo. Costs based on Cloud T4 GPU instance market price & V100 GPU price on Databricks Standard edition。

易于使用

運(yùn)行以前的 Apache Spark 應(yīng)用不需要改變代碼。啟動(dòng) Spark with the RAPIDS Accelerator for Apache Spark plugin jar然后打開配置，如下：

spark.conf.set('spark.rapids.sql.enabled','true')

physical plan with operators運(yùn)行在GPU

一個(gè)統(tǒng)一的 AI framework for ETL + ML/DL

單一流水線，從數(shù)據(jù)準(zhǔn)備到模型訓(xùn)練：

Spark 3.0怎么使用GPU加速

開始使用RAPIDS Accelerator for Apache Spark

Apache Spark 3.0+ 為用戶提供了 plugin可以替換 SQL 和 DataFrame 操作。不需要對(duì)API做出改變，該 plugin替換 SQL operations為 GPU 加速版本。如果該操作不支持GPU加速將轉(zhuǎn)而用 Spark CPU 版本。

??注意plugin不能加速直接對(duì)RDDs的操作。

該 accelerator library 同時(shí)提供了Spark’s shuffle的實(shí)現(xiàn)，可以利用 UCX 優(yōu)化 GPU data transfers，keeping as much data on the GPU as possible and bypassing the CPU to do GPU to GPU transfers。

該 GPU 加速處理 plugin 不要求加速的 shuffle 實(shí)現(xiàn)。但是，如果加速 SQL processing未開啟，該shuffle implementation 將使用缺省的SortShuffleManager。

開啟 GPU 處理加速，需要：

Apache Spark 3.0+
A spark cluster configured with GPUs that comply with the requirements for the version of cudf.

One GPU per executor.

The following jars:

A cudf jar that corresponds to the version of CUDA available on your cluster.
RAPIDS Spark accelerator plugin jar.

To set the config spark.plugins to com.nvidia.spark.SQLPlugin

Spark GPU 調(diào)度概覽

Apache Spark 3.0 現(xiàn)在支持 GPU 調(diào)度與 cluster manager 一樣。你可以讓 Spark 請(qǐng)求 GPUs 然后賦予tasks。精確的配置取決于 cluster manager的配置。下面是一些例子：

Request your executor to have GPUs:

--conf spark.executor.resource.gpu.amount=1

Specify the number of GPUs per task:

--conf spark.task.resource.gpu.amount=1

Specify a GPU discovery script (required on YARN and K8S):

--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh

查看部署的詳細(xì)信息確定其方法和限制。

注意 spark.task.resource.gpu.amount 可以是小數(shù)，如果想要 multiple tasks to be run on an executor at the same time and assigned to the same GPU，可以設(shè)置為小于1的小數(shù)。要與 spark.executor.cores 設(shè)置相對(duì)應(yīng)。例如，spark.executor.cores=2 將允許 2 tasks 在每一個(gè) executor，并且希望 2 tasks 運(yùn)行在同一個(gè) GPU，將設(shè)置spark.task.resource.gpu.amount=0.5。

看完上述內(nèi)容，你們對(duì)Spark 3.0怎么使用GPU加速有進(jìn)一步的了解嗎？如果還想了解更多知識(shí)或者相關(guān)內(nèi)容，請(qǐng)關(guān)注億速云行業(yè)資訊頻道，感謝大家的支持。

向AI問(wèn)一下細(xì)節(jié)

Spark 3.0怎么使用GPU加速

概覽

Spark RAPIDS 下載 v0.4.1

RAPIDS Notebooks

介紹

用法

下載 CUDA Installer for Linux Ubuntu 20.04 x86_64

性能 & 成本與收益

易于使用

一個(gè)統(tǒng)一的 AI framework for ETL + ML/DL

開始使用RAPIDS Accelerator for Apache Spark

Spark GPU 調(diào)度概覽

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽