Spark is a platform/environment to allow us to stream and parallel computing big data way faster. There are tons of resources and reading you would do to know more about Spark, so I will just dive into the installation and simple code for running pyspark on counting and sorting words from a book. Basically, just get to know what the keywords or most frequent words for a book.
I wanna use pyspark on my local machine OSX. Pyspark is a library that marriage between python and spark.
To install Pyspark, you could just ‘pip install pyspark’, but you have to install Java first. Go here to see the full detail of pyspark installation.
After pip-install, I ran into an error said “No Java runtime present, requesting install.”. If you encounter the same error, you could refer to this stackoverflow post. I basically added “export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home” on my mac terminal. It solved the error and I was able to run spark on my computer.
import re from pyspark import SparkConf, SparkContext def normalizeWords(text): return re.compile(r'\W+', re.UNICODE).split(text.lower()) conf = SparkConf().setMaster("local").setAppName("WordCount") sc = SparkContext(conf = conf) input = sc.textFile("book.txt") words = input.flatMap(normalizeWords) wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y) wordCountsSorted = wordCounts.map(lambda x: (x[1], x[0])).sortByKey() results = wordCountsSorted.collect() for result in results: count = str(result[0]) word = result[1].encode('ascii', 'ignore') if (word): print(word.decode() + ":\t\t" + count)
Re library is a text mining/regular expression in Python, and for other choices, you could use Spacy or NLTK instead of (or together with) Re library too.
If you wanna learn more pyspark, I recommend Frank Kane, he has an excellent online course on Spark.