博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library
阅读量:6971 次
发布时间:2019-06-27

本文共 7203 字,大约阅读时间需要 24 分钟。

Machine and statistical learning wizards are becoming more eager to perform analysis with library if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster - let’s say 100 machines on  cluster makes you the real data cruncher! In this post I present package (by ), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it - user sparklyr!

If you don’t know much about Spark yet, you can read my April post  - where I explained how could we use SparkR package that is distributed with Spark. Many things (code) might have changed since that time, due to the rapid development caused by great popularity of Spark. Now we can use version 2.0.0 of Spark. If you are migrating from previous versions I suggest you should look at .

sparklyr basics

This packages is based on  package that enables to run Spark applications locally or on YARN cluster just from R. It translates R code to bash invocation of spark-shell. It’s biggest advantage is interface for working with Spark Data Frames (that might be Hive Tables) and possibility to invoke algorithms from Spark ML library.

Installation of sparklyr, then Spark itself and simple application initiation is described by this code

library(devtools) install_github('rstudio/sparklyr') library(sparklyr) spark_install(version = "2.0.0") sc <- spark_connect(master="yarn", config = list( default = list( spark.submit.deployMode= "client", spark.executor.instances= 20, spark.executor.memory= "2G", spark.executor.cores= 4, spark.driver.memory= "4G")))

One don’t have to specify config by himself, but if this is desired then remember that you could also specify parameters for Spark application with  files so that you can benefit from many profiles (development, production). In version 2.0.0 it is desired to name master yarn instead of yarn-client and passing the deployMode parameter, which is different from version 1.6.x. All available parameters can be found in  documentation page.

dplyr and DBI interface on Spark

When connecting to YARN, it is most probable that you would like to use data tables that are stored on Hive. Remember that

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

where conf/ is set as HADOOP_CONF_DIR. Read more about using 

If everything is set up and the application runs properly, you can use dplyr interface to provide lazy evaluation for data manipulations. Data are stored on Hive, Spark application runs on YARN cluster, and the code is invoked from R in the simple language of data transformations (dplyr) - everything thanks to sparklyr team great job! Easy example is below

library(dplyr) # give the list of tables src_tbls(sc) # copies iris from R to Hive iris_tbl <- copy_to(sc, iris, "iris") # create a hook for data stored on Hive data_tbl <- tbl(sc, "table_name") data_tbl2 <- tbl(sc, sql("SELECT * from table_name"))

You can also perform any operation on datasets use by Spark

iris_tbl %>%   select(Petal_Length, Petal_Width) %>% top_n(40, Petal_Width) %>% arrange(Petal_Length)

Note that original commas in iris names have been translated to _.

This package also provides interface for functions defined in DBI package

library(DBI) dbListTables(sc) dbGetQuery(sc, "use database_name") data_tbl3 <- dbGetQuery(sc, "SELECT * from table_name") dbListFields(sc, data_tbl3)

Running Spark ML Machine Learning K-means Algorithm from R

The basic example on how sparklyr invokes Scala code from Spark ML will be presented on K-means algorithm. If you check the code of sparklyr::ml_kmeans function you will see that for inputtbl_spark object, named x and character vector containing features’ names (featuers)

envir <- new.env(parent = emptyenv()) df <- spark_dataframe(x) sc <- spark_connection(df) df <- ml_prepare_features(df, features) tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

This is done in a new environment, so that we can store arguments for future ML algorithm and the model itself in its own environment. This is safe and clean solution. You can construct a simple model calling a Spark ML class like this

envir$model <- "org.apache.spark.ml.clustering.KMeans" kmeans <- invoke_new(sc, envir$model)

which invokes new object of class KMeans on which we can invoke parameters setters to change default parameters like this

model <- kmeans %>% invoke("setK", centers) %>% invoke("setMaxIter", iter.max) %>% invoke("setTol", tolerance) %>% invoke("setFeaturesCol", envir$features) # features where set in ml_prepare_dataframe

For an existing object of KMeans class we can invoke its method called fit that is responsible for starting the K-means clustering algorithm

fit <- model %>% invoke("fit", tdf)

which returns new object on which we can compute, e.g centers of outputted clustering

kmmCenters <- invoke(fit, "clusterCenters")

or the Within Set Sum of Squared Errors (called Cost) (which is mine small contribution  )

kmmCost <- invoke(fit, "computeCost", tdf)

This sometimes helps to decide 

and is presented in print method for ml_model_kmeans object

iris_tbl %>%   select(Petal_Width, Petal_Length) %>% ml_kmeans(centers = 3, compute.cost = TRUE) %>% print() K-means clustering with 3 clusters Cluster centers: Petal_Width Petal_Length 1 1.359259 4.292593 2 2.047826 5.626087 3 0.246000 1.462000 Within Set Sum of Squared Errors = 31.41289

All that can be better understood if we’ll have a look on  ( where methods and parameters have different names than those in Spark ML). This enabled me to provide simple update for ml_kmeans() () so that we can specify tol (tolerance) parameter in ml_kmeans() to support tolerance of convergence.

 

 37 

Few weeks ago I have a great pleasure of attending  Conference at Stanford, where I have learned a lot! It wouldn’t be possible without the scholarship that I received from Bioconductor (organizers), which I deeply appreciate. It was an excellent place for software developers, statisticians and biologists to exchange their experiences and to better explain their work, as the understanding between collaborators in interdisciplinary teams is essential. In this post I present my thoughts and feelings about the event and I share the knowledge that I have learned during the event, i.e. about many ways of downloading  data.

转自:

转载于:https://www.cnblogs.com/payton/p/5809771.html

你可能感兴趣的文章
资源管理器操作
查看>>
黄灯:一个农村儿媳眼中的乡村图景
查看>>
ngCordova插件说明
查看>>
ssh的各个jar包作用
查看>>
IIS8报错 403 404
查看>>
python1113
查看>>
Linux-安装MongoDB
查看>>
关于如何生成《模拟用户日志(登录登出)》的思考
查看>>
Secrets of the JavaScript Ninja
查看>>
SQL Server-简单查询语句,疑惑篇(三)
查看>>
事件和委托
查看>>
接口练习题(接口之间多继承的应用)
查看>>
OpenMesh 删除网格顶点
查看>>
基于.NET平台常用的框架整理
查看>>
css3颜色渐变
查看>>
一个完整的大作业
查看>>
帮助大家理解java中的随机和继承,动态绑定.
查看>>
redis安装及入门
查看>>
PostgreSQL 学习手册-模式Schema
查看>>
系统性能瓶颈分析
查看>>