CarsonWang英特尔高性能数据分析研发团队负责人,专注于研发和优化开源大数据,分布式机器学习框架,开发大数据和人工智能融合解决方案。他目前领导以下一些开源项目包括RayDP-SparkonRay,OAPMLlib-高性能版Spark机器学习算法库。此前,他主导研发了SparkSQL自适应执行引擎,HiBench-大数据基准测试工具等项目。
1.使用RayDP-SparkonRay构建端到端的大数据分析和人工智能应用CarsonWangIntel
2.AgendaBigData&AIBackgroundRayDPOverviewRayDPDeepDiveRayDPExamples
3.BigData&AIBackground
4.BigData&AIHorovodOnSparkPetastormMassivedataiscriticalXGBoostOnSparkforbetterAI.spark-tensorflowDistributedtrainingwillTensorflowOnSparkconnectorbeanorm.Manycommunityspark-tensorfloweffortstointegratebigBigDL,Analytic-ZoodistributordatawithAI.CaffeOnSpark
5.SeparateSparkandAIClusterSparkClusterML/DLClusterChallenges:DatamovementDataModelTrainingbetweenclusters.PreprocessingOverheadofmanagingtwoclusters.Segmentedapplicationandgluecode.ML/DLStorage
6.RunningML/DLFrameworksonSparkSparkClusterChallenges:SpecifictoSparkandrequiresML/DLDataModelframeworkssupportedonPreprocessingTrainingSpark.DataexchangebetweenframeworksreliesondistributedfilesystemslikeHDFSorS3.
7.RunningonKubernetesChallenges:KubernetesClusterThepipelinemustbewritteninmultipleDataPreprocessingModelTrainingprogramsandconfigurationfiles(v.s.asinglepythonprogram).DataexchangebetweenframeworksreliesondistributedfilesystemslikeHDFSorS3.
8.RayDPOverview
9.WhatisRayDPRayDPprovidessimpleAPIsforrunningSparkonRayandintegratingSparkwithdistributedML/DLframeworks.PyTorch/TensorflowEstimatorSparkonRayRayMLDatasetConverterRayDPRayLibraries
10.BuildEnd-to-EndPipelineusingRayDPandRayEnd-to-EndIntegratedPythonProgramModelDataPreprocessingModelServingTraining/TuningObjectStore
11.ScaleFromLaptopToCloud/KubernetesSeamlesslyYourPythonProgramWrittenbyRay,RayClusterRayDP,pySpark,LauncherTensorflow,PyTorch,etcAPIsDevelopLocallyScaletoCloud/Kubernetes
12.WhyRayDPIncreasedProductivity§Simplifyhowtobuildandmanageend-to-endpipeline.WriteSpark,Xgboost,Tensorflow,PyTorch,Horovodcodeinasinglepythonprogram.BetterPerformance§In-memorydataexchange.§Built-inSparkoptimization.IncreasedResourceUtilization§Autoscalingattheclusterlevelandtheapplicationlevel.
13.RayDPDeepDive
15.SparkonRayArchitecture32DriverSpark1AppMasterSparkExecutorWorkerSparkOneRayactorfor(JavaActor)WorkerExecutorSparkAppMastertoDriver(JavaActor)(JavaActor)23start/stopSparkexecutors.ObjectStoreObjectStoreObjectStoreRayletRayletRayletAllSparkexecutorsareinRayJavaactors.LeverageobjectstorefordataexchangeGCSWebUIbetweenSparkandGCSGCSDebuggingToolsotherRaylibraries.ProfilingTools
16.PyTorch/TensorflowEstimatorestimator=TorchEstimator(num_workers=2,model=your_model,CreateanEstimatoroptimizer=optimizer,withparameterslikeloss=criterion,model,optimizer,lossfeature_columns=features,function,etc.label_column="fare_amount",batch_size=64,FittheestimatorwithSparkdataframesnum_epochs=30)directly.estimator.fit_on_spark(train_df,test_df)
17.RayMLDatasetConverterOperationsfromraydp.sparkimportPlanningExecutionRayMLDatasetPhase1Phase2spark_df=…SparkPyTorchtorch_ds=RayMLDatasetSparkDatafraMLMLMLDataframDatasete.from_spark(sparmeDataseDataseDataseSparkActorPyTorchActork_df,…)ttransformtto_torcht.transform(func)1.from_spark.to_torch(…)from_sparktMLDataset2.transformtorch_dataloader=Actor+to_torchDataLoader(torch_ds.get_shard(sharMLDatasetShardd_index),…)ObjectObjectObject123BuildOperationGraphTransformationsarelazy,executedinObjectStorepipelineRaySchedulerCreatefromSparkdataframe,In-memoryobjects,etcTransformusinguserdefinedfunctionsConverttoPyTorch/TensorflowDataset