1、如果遇到“No module named pyspark”,则需要将py4j、pyspark拷贝至Python37\Lib\site-packages。
将D:\bigdata\spark-2.3.2-bin-hadoop2.7\python\lib目录下的
py4j-0.10.7-src.zip和pyspark.zip
解压缩、拷贝至C:\Program Files\Python37\Lib\site-packages目录下。
2、启动spark-shell
D:\bigdata\spark-2.3.2-bin-hadoop2.7\bin\spark-shell.cmd
3、测试代码。spark.py
如果测试代码spark.py无法正常运行,则将spark.py拷贝至D:\bigdata\spark-2.3.2-bin-hadoop2.7\examples\src\main\python,再次尝试。
#coding=utf-8 from pyspark import * # Create SparkConf conf = SparkConf().setAppName("WordCount").setMaster("local") # Create SparkContext sc = SparkContext(conf=conf) # 从本地模拟数据 datas = ["jzh, car", "jzh, house", "idodo, house"] # Create RDD rdd = sc.parallelize(datas) print("记录条数:" + str(rdd.count())) #print(rdd.first()) # WordCount # rdd.flatMap(lambda line: line.split(",")) \ # 字符串进行分割 # map(lambda word: (word, 1)) \ # 映射为(word,1)元祖,例如 (jzh,1 ) (car,1 ) # reduceByKey(lambda a, b: a + b) # 将Key相同的Value进行合并 wordcount = rdd.flatMap(lambda line: line.split(",")) \ .map(lambda word: (word, 2)) \ .reduceByKey(lambda a, b: a + b) # collect()函数将rdd转换为列表 for wc in wordcount.collect(): print(wc[0] + ":出现次数" + str(wc[1]))