Spark is a platform/environment to allow us to stream and parallel computing big data way faster. There are tons of resources and reading you would do to know more about Spark, so I will just dive into the installation and simple code for running pyspark on counting and sorting words from a book. Basically, just get to know what the keywords or most frequent words for a book.
I wanna use pyspark on my local machine OSX. Pyspark is a library that marriage between python and spark.
After pip-install, I ran into an error said “No Java runtime present, requesting install.”. If you encounter the same error, you could refer to this stackoverflow post. I basically added “export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home” on my mac terminal. It solved the error and I was able to run spark on my computer.
import re from pyspark import SparkConf, SparkContext def normalizeWords(text): return re.compile(r'\W+', re.UNICODE).split(text.lower()) conf = SparkConf().setMaster("local").setAppName("WordCount") sc = SparkContext(conf = conf) input = sc.textFile("book.txt") words = input.flatMap(normalizeWords) wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y) wordCountsSorted = wordCounts.map(lambda x: (x, x)).sortByKey() results = wordCountsSorted.collect() for result in results: count = str(result) word = result.encode('ascii', 'ignore') if (word): print(word.decode() + ":\t\t" + count)
Re library is a text mining/regular expression in Python, and for other choices, you could use Spacy or NLTK instead of (or together with) Re library too.
If you wanna learn more pyspark, I recommend Frank Kane, he has an excellent online course on Spark.