Install Spark (or PySpark) on your computer

Spark is a platform/environment to allow us to stream and parallel computing big data way faster. There are tons of resources and reading you would do to know more about Spark, so I will just dive into the installation and simple code for running pyspark on counting and sorting words from a book. Basically, just get to know what the keywords or most frequent words for a book.

I wanna use pyspark on my local machine OSX. Pyspark is a library that marriage between python and spark.

To install Pyspark, you could just ‘pip install pyspark’,  but you have to install Java first. Go here to see the full detail of pyspark installation.

After pip-install, I ran into an error said “No Java runtime present, requesting install.”. If you encounter the same error, you could refer to this stackoverflow post. I basically added “export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home” on my mac terminal. It solved the error and I was able to run spark on my computer.

import re
from pyspark import SparkConf, SparkContext

def normalizeWords(text):
 return re.compile(r'\W+', re.UNICODE).split(text.lower())

conf = SparkConf().setMaster("local").setAppName("WordCount")
sc = SparkContext(conf = conf)

input = sc.textFile("book.txt")
words = input.flatMap(normalizeWords)

wordCounts = x: (x, 1)).reduceByKey(lambda x, y: x + y)
wordCountsSorted = x: (x[1], x[0])).sortByKey()
results = wordCountsSorted.collect()

for result in results:
 count = str(result[0])
 word = result[1].encode('ascii', 'ignore')
 if (word):
 print(word.decode() + ":\t\t" + count)

Re library is a text mining/regular expression in Python, and for other choices, you could use Spacy or NLTK instead of (or together with) Re library too.

If you wanna learn more pyspark, I recommend Frank Kane, he has an excellent online course on Spark. 


A time series Stock API development with Python Bokeh and Flask to Heroku

My final API looks like this:


You could search the stock here on my API link:

If you’re interested in looking for more ticker symbols for company stock, you could go here.

For example, if you wanna search the ticker code for a company, using “B” instead of Barnes for Barnes Group. It has to be entered an upper case symbol code like the following table:


It’s not a most beautiful and amazing APP, but through hours of coding in Python just make me appreciated how much work and how amazing like Ameritrade is. Making an online data visualization tool is not an easy job, especially when you wanna render data from another sites or database.

To be honest, I would have made a better looking and searching engine with Shiny R in more efficient way, but since this API is my milestone project with The Data Incubator (even before the program is started on Jun. 19, 2017 ), and we are only allowed to use Flask, Bokeh, and Jinja with Python, and deploy the API to Heroku.  Here we go, this is the note that would help you or remind me later when I need to develop another API using Python.

First, go to to register an API key, since the API will render data from Quandl.

Second, know how to request Data from You could render data: 1) using Request library or simplejson to request JSON dataset from Quandl; 2) you could use quandl python library.  I requested data using the quandl library because it’s much easy to use.

Third, to develop a Flask framework that could plot dataset from user’s ticker input. See the following Flask framework:

from flask import Flask, render_template,request,redirect
import quandl as Qd
import pandas as pd
import numpy as np
import os
import time
from import curdocfrom bokeh.layouts import row, column, gridplot
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import PreText, Select
from bokeh.plotting import figure, show, output_file
from bokeh.embed import components,file_html
from os.path import dirname, join
app = Flask(__name__)
###Load data from Quandl
# Here define your dateframe
@app.route("/plot", methods=['GET','POST'])    
# Here define the plot you plot.#e.g
def plot():
###### load dataframe and plot it out plot = create_figure(mydata, current_feature_name);
script, div = components(plot)
return render_template('Plot.html', script=script, div=div)

@app.route('/', methods=['GET','POST'])
def main():
return redirect('/plot')
if __name__== "__main__":, debug = True)

Fourth, make your Flask APP worked on your local computer, I mean it should look exactly like above API before I deployed to Heroku.My local API directory and files are organized in this way:

5F853E2A-DC8A-47F0-8FD1-6CE5D8FAE297 is the main python code that renders data from Quandl, plot the data with Bokeh, and bound it with Flask framework to deploy to Heroku.

Fifth, Push everything above to a Github repository, using Git-CLI command lines:

git init
git add .
git commit -m 'initial commit'
heroku login
heroku create ###Name of you app/web
git push heroku master

The last but not the least, in case you wanna edit your Python code or other files to update your Heroku API. You could again do:

###update heroku app from github
heroku login
heroku git:clone -a <your app name>
cd <your app name>
#make changes here and then follow next step to push the changes to heroku
git remote add <your git repository name><your git username>/<your git repository name>
git git fetch <your git repository name> master
git reset --hard <your git repository name>/master
git push heroku master --force

Other reads might be helpful here:

  1.  Bokeh and Flask API blog;
  2. and how to deploy python Heroku API.


Artificial intelligence on urban tree species identification 人工智能在市区树种识别上的应用

It doesn’t matter which part of the world you are living now,  very diverse tree species are planted around the urban area we live.  Trees in the urban areas have many functions, for example, trees provide habitats for wildlife, clean air and water, provide significant health and social benefits, and also improve property value too.  Wake up in a beautiful morning that birds are singing outside your apartment because you have many beautiful trees grow outside of your space. How awesome is that!

However, tree planting, survey, and species identification require an enormous amount of work that literally took generations and years of inputs and care. What if we could identify tree species from satellite imagery, how much faster and how well we could get tree species identified and also tell their geolocations as well.

A city has its own tree selection and planting plan, but homeowners have their own tree preference, which the identification work a bit complicated, though.


(Photo from Google Earth Pro June 2010 in Chicago area)

It’s hard to tell now how many tree species are planted in above image. But we could (zoom in and) tell these trees actually have a slightly different shape of tree crown, color, and texture. From here I only need to have a valid dataset basically tell me what tree I am looking at now, which is a tree survey and trees geolocation records from the city. I will be able to teach a computer to select similar features for the species I’m interested in identifying.


These are Green Ash trees (I marked as green dots here).


These are Littleleaf Linden, they are marked as orange dots.

Let me run a Caffe deep learning model (it’s one of the neural networks and also known as artificial intelligence model) for an image classification on these two species, and see if the computer could separate these two species from my training and test datasets.

Great news that the model could actually tell the differences between these two species. I run the model for 300 epochs (runs) from learning rate 0.01 to 0.001 on about 200 images for two species. 75% went to train the model and 25% for testing. The result is not bad that we have around 90% of accuracy (orange line) and less than 0.1 loss on the training dataset.


I threw a random test image to the model (a green ash screenshot in this case) and it tells the result.


I will be working on identifying other 20 trees species and their geolocations next time.

Let’s get some answer what trees are planted in Chicago area and how it related to the property value (an interesting question to ask), and also what ecological benefits and functions these tree are providing (leave this to urban ecologist if my cloud computer could identify the species)? Check my future work ;-).


Can artificial intelligence help us identify wildfire damage from satellite imagery faster? 我们能否借助人工智能算法快速地从卫星影响中定位火灾损毁地点和损毁程度?

The following work was done by me and Dr. Shay Strong, while I was a data engineer consultant under the supervision of Dr. Strong  at OmniEarth Inc. All the work IP rights belong to OmniEarth. Dr Strong is the Chief Data Scientist at OmniEarth Inc.

以下要介绍的工作是我在OmniEarth公司做数据工程师的时候和Shay Strong博士共同完成的工作。工作的知识产权归OmniEarth公司所有,我的老板Shay Strong博士是OmniEarth公司的数据科学家团队的领头人。

A wildfire had been burning in the Great Smoky Mountains of Tennessee and raced rapidly northward toward Gatlinburg and Pigeon Forge between late Nov. and Dec. 2nd, 2016. At least 2000 buildings were damaged or destroyed across 14,000 acres of residential and recreational land, while the wildfire also claimed 14 lives and injured 134. It was the largest natural disaster in the history of Tennessee.

2016年11月到12月田纳西州的大烟山国家公园森林(Great Smoky Mountains)大火,随后火势蔓延至北部的两个地区Gatlinburg 和Pigeon Forge。据报道大火损毁2000多栋包括民宅和旅游区建筑物,损毁面积达到1万4千英亩,火灾致使14人死亡134人受伤。被认为是田纳西州历史上最大的自然灾害。

After obtaining 0.4 m resolution satellite imagery of the wildfire damage in Gatlinburg and Pigeon Forge from Digital Global, OmniEarth Inc created an artificial intelligence (AI) model that was able to assess and identify the property damage due to the wildfire. This AI model will also be able to more rapidly evaluate and identify areas of damage from natural disasters from similar issues in the future.

从Digital Global获得大约为0.4米分辨率的高分辨率遥感图像(覆盖了火灾发生的Gatlinburg 和Pigeon之后)我们建立了人工智能模型。该人工智能模型可以帮助我们快速定位和评估火宅受灾面积和损毁程度。我们希望该模型未来可帮助消防人员快速定位火灾险情和火灾受损面积。

The fire damage area was identified by the model on top of the satellite images.


2017-01-26 22.15.10.gif

Fig 1. The final result of fire damage range in TN from our AI model. 该图是通过人工智能模型生成的火灾受灾范围图。

1. Artificial intelligence model behind the wildfire damage火灾模型背后的人工智能

With assistance from increasing cloud computing power and a better understanding of computer vision, more and more AI technology is helping us detect information from trillions of photos we produce daily.计算机图像识别和云计算能力的提升,使得我们能够借助人工智能模型获取数以万计甚至亿计的照片地图等图片中获取有用的信息。

Before diving into the AI model behind the wildfire damage, in this case, we only want to identify the differences between fire-damaged buildings and intact buildings. We have two options: (1), we could spend hours and hours browsing through the satellite images and manually separate the damaged and intact buildings or (2) develop an AI model to automatically identify the damaged area with a tolerable error. For the first option, it would easily take a geospatial technician more than 20 hours to identify the damaged area among the 50,000 acres of satellite imagery. The second option poses a more viable and sustainable solution in that the AI model could automatically identify the damaged area/buildings less than 1 hour over the same area. This is accomplished by image classification in AI, using convolutional neural networks (CNN) specifically, because CNN works better than other neural network algorithms for object detection and recognition from images.



Fig 2. Our AI model workflow. 我们的人工智能模型框架。

Artificial intelligence/neural networks are a family of machine learning models that are inspired by biological neurons of our human brain. First conceived in the 1960s, but the first breakthrough was Geoffrey Hinton’s work published in the mid-2000s. While our human eyes work like a camera seeing the ‘picture,’ our brain will process it and be able to construct the objects we see through the shape, color, and texture of the objects. The information of “seeing” and “recognition” is passing through our biological neurons from our eyes to our brain. The AI model we created works in a similar way. The imagery is passed through the artificial neural network, and objects that have been taught to the neural network are identified with certain accuracy. In this case, we taught the network to learn the difference between burnt and not-burnt structures in Gatlinburg and Pigeon Forge, TN.

2. How did we build the AI model

We broke down the wildfire damage mapping process into four parts (Fig 1). First, we obtained the 0.4m resolution satellite images from Digital Globe ( We created a training and a testing dataset of 300 small images chips (as shown in Fig 3, A and B) that contained both burnt and intact buildings, 2/3 of which go to train the AI model, CNN model in this case, and 1/3 of them are for test the model. Ideally, the more training data used to represent the burnt and non-burnt structures are ideal for training the network to understand all the variations and orientations of a burnt building. The sample set of 300 is on the statistically small side, but useful for testing capability and evaluating preliminary performance.

 burned.png  intact.png
Fig 3(A). A burnt building Fig3(B). Intact buildings

Our AI model was a CNN model that built upon Theano (GPU backend) ( Theano was created by the Machine Learning group at the University of Montreal, led by Yoshua Bengio, who is one of the pioneers behind artificial neural networks. Theano is a Python library that lets you define and evaluate mathematical expressions with vectors and matrices. As a human, you can imagine our daily decision-making is based on the matrices of perceived information as well, e.g. which car you want to buy. The AI model helps us to identify which image pixels and patterns are fundamentally different between burnt and intact buildings, similar to how people give a different weight or score to the car brand, model, and color they want to buy. Computers are great at calculating matrices, and Theano brings it to next level because it calculates multiple matrices in parallel, and so speeds up the whole calculation tremendously. Theano has no particular neural network built-in, so we use Keras on top of Theano. Keras allows us to build an AI model with a minimalist design on training layers of a neural network and run it more efficiently.

Our AI model was run on AWS EC2 with a g2.2xlarge instance type. We set the learning rate (lr) to 0.01.. A smaller learning rate will force the network to learn more slowly but may also lead to optimal classification convergence, especially in cluttered scenes where a large amount of object confusion can occur. In the end, our AI model with came out with 97% of accuracy, less than 0.3 loss over three runs within a minute, and it took less than 20 minutes to run on our 3.2G satellite images.

The model result was exported and visualized using QGIS ( QGIS is an open source geographic information system that allows you to create, edit, visualize, analyze and publish geospatial information and maps. The map inspection was also done through comparing our fire damage results to the briefing map produced by Wildfire Today ( and Incident Information System (


Fig 4. (A). using OmniEarth parcel level burnt and intact buildings layout on top of the imagery.


Fig 4 (B). The burnt impact (red color) on top of the Great Smoky Mountains from late Nov. to early Dec 2016.

Satellite image classification is a challenging problem that lies at the crossroads of remote sensing, computer vision, and machine learning. A lot of currently available classification approaches are not suitable to handle high-resolution imagery data with inherent high variability in geometry and collection times. However, OmniEarth is a startup company that is passionate about the business of science and scaling quantifiable solutions to meet the world’s growing need for actionable information.

Contact OmniEarth for more information:

For more detailed information, please contact Dr. Zhuangfang Yi, email:; twitter: geonanayi.


Dr. Shay Strong, email:; twitter: shaybstrong.

Why”#” is important? Data mining and streaming from twitter using Python

To be able to exact big data from twitter, you have to register an API for twitter.

I installed Python3.5 and edit my Windows8.1 environmental variables setting from ‘advance computer system setting. I downloaded Tweepy (exacting data from twitter using python), and the tweepy could not be installed in my computer Command Prompt. It reminded me that I have to log in my computer as the administrator to be able to install tweepy. Of course, right?! Sometime you just lose the battle by doing something not very smart. I relogged in my computer as the administrator and problem solved.

Marco Bonzanini has written a full 7 blogs about how to do data mining from twitter if you ever interested in doing big data analysis.