Label Maker：四个命令行教你如何生成卫星影像图机器学习训练数据集

On January 30, 2018January 31, 2018 By Zhuangfang YiIn AWS, Big Data, Data Science, Deep Learning, Imagery analysis, Machine Learning, Neural Network, Python, Satellite Imagery Processing, Spatial AnalysisLeave a comment

This is Chinese version, if you have not seen the blog (in English) yet, go here: https://developmentseed.org/blog/2018/01/11/label-maker/

Label Maker is a python library to help in extracting insight from satellite imagery. Label Maker creates machine-learning-ready training data for most popular ML frameworks, including Keras, TensorFlow, and MXNet. It pulls data from OpenStreetMap and combines that with imagery sources like Mapbox or Digital Globe to create a single file for use in training machine learning algorithms.

简介：

Label Maker 是我们最近开发的开源python软件包，帮助大家更好更深刻的理解卫星影像。Label Maker 可以生成可连接到任何你喜爱的机器学习（或深度学习），比如目前最流行的谷歌TensorFlow, MXNet，用Keras来编程毫无障碍。我们的软件包从 OpenStreetMap 和Mapbox 或者Digital Globe 上面获取数据，生成训练数据集。如果有同学写出可以链接到起他卫星影像数据源上，我们也非常欢迎大家修改和提交程序到我们的GitHub repo上面。另外，如果你想学习如何做对象检测（object detection）或者影像分类（image classification）我们也给大家准备了各种例子，欢迎使用欢迎留言反馈呀。

下面是正文啦！

ob_tf_result_fig1

机器学习和深度学习算法在计算机视觉上的应用日新月异。传统的卫星影像解译非常快速也方便，比如大家可以用ERDAS，ArcGIS等做。但是这些传统的方法也有一个局限，那就是如果你的卫星影像图分辨率高一点，图片大一点了，通常这些应用软件和你的台式电脑可能就跑不动了。要快速有效的解决这些问题怎么办？我今天就来回答这个问题，我们如何借助现在的GPU和机器学习来大规模的处理和解译卫星影像。

先来小小的了解一下，现在计算机视觉里面的深度学习大概可以分为三大类：监督学习，非监督学习和强度递归学习（最后这个不知道咋翻译）。卫星图像解译传统的方法也有监督学习和非监督学习。监督学习可以理解为：你告诉帮你做图像解译的软件：河流，海洋，森林看起来是啥样子的，然后软件就根据你给的阈值去计算和分类。非监督学习就是你不告诉软件，软件根据给定的卫星影像帮你分类，比如河流和海洋，从红、绿、蓝三个波段看起来就是不一样，所以软件可以根据两类不通的波段阈值将其分开。

深度学习也可以做监督学习和非监督学习。刚才也说了，有传统软件，为啥咱们还要用深度学习呢，是因为最近大家都在哈这个吗？no，no，no。。。

深度学习在卫星影响上出了可以通过大量使用GUP来加快计算之外，深度学习只要一次训练之后，可以利用训练好的模型权重（trained model weight）来反复训练未知的区域。你训练和学习的次数越多，时间越长，模型的表现就更好。这个链接大家可以看看我们用机器学习解译的道路网络。道路网络解译在卫星影像解译中是最难的，我先不告诉你，你自己来猜一猜下次我来公布答案。另外我们也做了很多类似的深度学习在卫星影响上的应用相关的例子，比如找房子就用到了TensorFlow对象检测，利用MXNet和亚马逊SageMaker来做的分类模型和另外一个利用Keras和亚马逊云计算机分类模型。

废话说了那么多，照着现在深度学习发展的速度，开发新的算法其事不算太难。难的是怎么准备机器学习和深度学习可以使用训练数据集。

我今天隆重的来给大家介绍一下我们的pytho 软件包Label Maker。Label Maker是个开源的软件所以在github上面大家随意给我们点赞和folk，我们也鼓励大家踊跃贡献。Label Maker通过获取Mapbox的卫星影像和OpenStreetMap的矢量数据（比如道路，房子，树林）等等，打包和生成训练数据。大家可以把这个数据链接到自己最喜爱的省督学习和机器学习上面。Label Maker模型只需要五个命令行就帮你生成训练数据集了哦。

只要 pip install label_maker之后，跑这四行命令行就可以了。

label-maker download         # download OpenStreetMap QA Tiles
label-maker labels           # create your ground-truth labels
label-maker images           # download satellite imagery tiles
label-maker package          # package tiles and labels into data.npz

当然我省略了两小个步骤：

比如要从Mapbox上下载卫星影像图片，你要有一个他们影像API的token，所以去注册一个Mapbox的账号吧。

然后使用上面的四个命令行之前，要生成一个配置文件（configure file），先这样的：

{
  "country": "vietnam",
  "bounding_box": [105.42,20.75,106.41,21.53],
  "zoom": 17,
  "classes": [
    { "name": "Buildings", "filter": ["has", "building"] }
  ],
  "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=ACCESS_TOKEN",
  "background_ratio": 1,
  "ml_type": "classification"
}

我们的python软件包会读取配置文件里面的参数来生成你需要的训练数据集。记得在把配置文件中的ACCESS_TOKEN 替换成你mapbox上生成的token哟。

等上面四个命令行顺利跑完成之后，你就有了data.npz就可以跑你最爱的机器学习算法了。比如下面这样：

# the data, shuffled and split between train and test sets
npz = np.load('data.npz')
x_train = npz['x_train']
y_train = npz['y_train']
x_test = npz['x_test']
y_test = npz['y_test']

# define your model here, example usage in Keras
model = Sequential()
# ...
model.compile(...)

# train
model.fit(x_train, y_train, batch_size=16, epochs=50)
model.evaluate(x_test, y_test, batch_size=16)

想要了解更详细的信息不要忘了访问我们的GitHub请大家不令赐赞👍和✨吧。

Working with geospatial data on AWS Ubuntu

On August 31, 2017December 8, 2018 By Zhuangfang YiIn AWS, Big Data, Data mining, Data Science, Geo-Cases, Geocoding, Python, Satellite Imagery Processing, Spatial AnalysisLeave a comment

I’ve stumbled on different sorts of problems while working with geospatial data on the cloud machine. AWS EC2 and Ubuntu sometimes require different setups. This is a quick note for installing GDAL on Ubuntu and how to transfer data from your local machine to your cloud machine without using S3.

To install GDAL


sudo -i
sudo add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable
sudo apt update
sudo apt upgrade # if you already have gdal 1.11 installed
sudo apt install gdal-bin python-gdal python3-gdal # if you don't have gdal 1.11 already installed

To transfer data (SFTP) from your local machine to AWS EC2, you could use FileZilla.

Another option is using S3 with Cyberduck

To set up the environment, please refer to this post and this video.

If you are interested in learning more about the tools, we have:

Geolambda that you can run few docker containers that provided to run geospatial analysis on the cloud;
If you are interested in applying machine learning to satellite imagery, we have a few tools: 1) Label Maker for training dataset generation; 2) looking-glass for building footprint segmentation; and 3) Pixel-Decoder for road network detection and segmentation.

How to use the online map tool for investing in sustainable rubber cultivation in tropical Asia如何利用在线地图工具投资热带亚洲可持续天然橡胶种植

On August 18, 2017September 29, 2017 By Zhuangfang YiIn Big Data, Data mining, Data Science, Data visualization, Ecological Economics, Environmental studies, Geo-Cases, Geocoding, Great Mekong region, Green rubber, Imagery analysis, Interactive Map, Natural rubber value chain, R, Spatial Analysis, Thailand, Value chain of tropical crops and products, Web GISLeave a comment

Please go ahead and play with the full-screen map here.

This map Application is developed to support the Guidelines for Sustainable Development of Natural Rubber, which led by China Chamber of Commerce of Metals, Minerals & Chemicals Importers & Exporters with supports from World Agroforestry Centre, East and Center Asia Office (ICRAF). Asia produces >90% of global natural rubber primarily in monoculture for highest yield in limited growing areas. Rubber is largely harvested by smallholders in remote, undeveloped areas with limited access to markets, imposing substantial labor and opportunity costs. Typically, rubber plantations are introduced in high productivity areas, pushed onto marginal lands by industrial crops and uses and become marginally profitable for various reasons.

请在这里播放全屏地图。

这个应用地图集的开发是为了支持由中国五矿化工进出口商会和世界农用林业中心等部门联合编制的《可持续天然橡胶指南》。亚洲天然橡胶的产量占全球的90%，且主要是在有限的种植地区内，通过单一的种植，达到最高的产量。橡胶主要是由小农户在偏远的、欠发达的、市场有限的地区通过利用大量的劳动力和机会成本获得的。一般来说，橡胶只应该种植在高产量的地区，但已经被工业化的发展推到了在边缘土地上种植，并因种种原因已经边缘到无利可图。

Rubberplantation

Fig. 1. Rubber plantations in tropical Asia. It brings good fortune for millions of smallholder rubber farmers, but it also causes negative ecological and environmental damages.

图1：亚洲热带橡胶种植园。它给数以万计的小橡胶农民带来收入，但它也造成了负面的生态和环境的破坏。

The online map tool is developed for smallholder rubber farmers, foreign and domestic natural rubber investors as well as different level of governments.

The online map tool entitled “Sustainable and Responsible Rubber Cultivation and Investment in Asia”, and it includes two main sections: “Rubber Profits and Biodiversity Conservation” and “Risks, SocioEconomic Factors, and Historical Rubber Price”.

该在线地图工具开发是为了小胶农、国内外天然橡胶投资者以及政府层面的政府使用。

这个标题为“亚洲可持续和负责任的天然橡胶种植和投资”的在线地图工具，包括两个主要部分：“橡胶利润和生物多样性保护”和“风险、社会经济因素和历史橡胶价格”。

The main user interface looks like the graph (Fig 2). There are 4 theme graphs and maps.

主用户界面看起来像图表（见图2）。有4个主题图和地图。

p1_section intro

Fig. 2. The main user interface of the online map tool.

图2：在线地图工具的主要用户界面。包括上图可见的“简介”，“第一部分”，“第二部分”，和“社交媒体分享”。

. Section 1 第一部分内容

This graph tells the correlation between “Minimum Profitable Rubber (USD/kg)” (the x-axis of the graph, and “Biodiversity (total species number)” in 2736 county that planted natural rubber trees in eight countries in tropical Asia. There are 4312 counties in total, and in this map tool, we only present county that has the natural rubber cultivated.

这张图显示了亚洲热带地区八个国家种植天然橡胶树的2736个县的最低橡胶成本（美元/千克）（图的X轴）和生物多样性（总种数）之间的关系。共有4312个县，在这个地图工具中，我们只提供了有天然橡胶种植的2736县相关的内容。

p1_section intro_high

Fig. 3. How to read and use the data from the first graph. Each dot/circle represents a county, the color, and size of it indicates the area of natural rubber are planted. When you move your mouse closer to the dot, you will see “(2.34, 552) 400000 ha @ Xishuangbanna, China”, 2.34 is the minimum profitable rubber price (USD/kg), 552 is the total wildlife species including amphibians, reptiles, mammals, and birds. “400000 ha” is the total area of planted natural rubber plantation from satellite images between 2010 and 2013. “@ Xishuangbanna, China” is the geolocation of the county.

图3：如何阅读和使用第一个图中的数据。每个圆点/圆代表一个县，其颜色和大小表示天然橡胶种植面积。当你移动你的鼠标时，比如你会看到“（2.34，552）400000公顷的“西双版纳、中国”，2.34是最低盈利（成本）橡胶价格（美元/公斤），552是总的野生物种，包括两栖动物、爬行动物、哺乳动物和鸟类。“400000公顷”是2010～2013年间卫星影像种植天然橡胶种植园的总面积。“西双版纳、中国”是本县的地理位置。

Don’t be shy, please go ahead and play with the full-screen map here. The minimum profitable rubber price is the market price for national standard dry rubber products that would help you to start makes profits. For example, if the market price of natural rubber is 2.0 USD/kg in the county your rubber plantation located, but your minimum profitable rubber price is 2.5 USD/kg means you will lose money by just producing rubber products. However, if your minimum profitable rubber price is 1.5 USD/kg means you will still make about 0.5 USD/kg profit from your plantation.

请不要拘谨，可以在这里浏览全屏地图。最低橡胶成本换算成国家标准的干橡胶产品的市场价格，这将有助于你理解您所属橡胶园的盈利起始点。例如，如果你所在的橡胶种植区的天然橡胶市场价格是2美元/公斤，但你的最低成本橡胶价格是2.5美元/公斤，意味着你生产橡胶产品就会亏本。然而，如果你的最低成本的橡胶价格是1.5美元/公斤意味着你的种植园仍然会赚约0.5美元/公斤的利润。

The county that has a lower minimum profitable price for natural rubber is generally going to make better rubber profit in the global natural rubber market. However, as scientists behind this research, we hope that when you rush to invest and plant rubber in a certain county, please also think about other risks, e.g. biodiversity loss, topographic, tropical storm, frost as well as drought risks. They are going to be shown later in this demonstration.

那些天然橡胶经营平均成本最低的县，在全球天然橡胶市场上将获得较好的橡胶利润。然而，作为这项研究背后的科学家，我们希望，当你在某个县匆忙投资成本较低的县市种植橡胶时，也要考虑其他风险，例如生物多样性丧失、地形、热带风暴、霜冻以及干旱风险。这些将被显示在这个演示之后。

p2_section intro_high.gif

Fig. 4. The first map is the “Rubber Cultivation Area”, which shows the each county that has rubber trees from low to high in colors from yellow to red. The second map “Minimum Profitable Rubber Price”(USD/kg), again the higher the minimum profitable price is the fewer rubber profits that farmers and investors are going to receive. The third map is ” Biodiversity (Amphibians, Reptiles, Mammals, and Birds)”, data was aggregated from IUCN-Redlist and BirdLife International.

图4：第一张地图是“橡胶种植区”，它显示了每个县的橡胶树种植数量从低到高的颜色，即从黄色到红色。第二张图“最低成本”（美元/千克），橡胶的平均成本越高，橡胶园的经营者就会获得更少的利润。第三地图是“生物多样性（两栖动物、爬行动物、哺乳动物和鸟类）”，数据来自世界自然保护联盟红色名录IUCN-Redlist和国际鸟盟聚集BirdLife International。

. Section 2 第二部分

We also demonstrated different types of risks that investors and smallholder farmers would face when they invest and plant rubber trees. Rubber tree doesn’t produce rubber latex before 7 years old, and the tree owners won’t make any profit until the tree is around 10 years old in general. In this section, we presented “Topographic Risk”, ” Tropical Storm”, “Drought Risk”, and “Frost Risk”.

我们还展示了投资者和小农投资种植橡胶树时会面临的不同风险类型。橡胶树种植前7年在橡胶树不生产任何胶乳的情况下是没有任何盈利的，甚至橡胶园的经营者一般在橡胶树种下10年之前都不会获利。该部分中，我们提出了“地形风险”、“热带风暴”、“干旱风险”和“霜冻风险”。

p3_section intro_high.gif

Fig. 5. Section 2 ” Risks, SocioEconomic Factors and Historical Rubber Price” has seven different theme maps and interactive graphs. They are “Topographic Risk”, ” Tropical Storm”, “Drought Risk”, and “Frost Risk”, “Average Natural Rubber Yield (kg/ha.year)”, “Minimum Wage for the 8 Countries (USD/day)”, and ” 10 years Rubber price”.

图5：第2节“风险、社会经济因素和橡胶价格历史”有七种不同的主题地图和互动图表。它们是“地形风险”、“热带风暴”、“干旱风险”、“霜冻风险”、“平均天然橡胶产量（千克/公顷）”、“8个国家的最低工资（美元/天）”和“10年橡胶价格”。

If you are interested in how the risk theme maps were produced, Dr. Antje Ahrends and her other coauthors have a peer-reviewed article published in Global Environmental Change in 2015. “Average Natural Rubber Yield (kg/ha.year)” and “Minimum Wage for the 8 Countries (USD/day)” dataset was obtained from International Labour Organization (ILO, 2014) and FAO.” 10 years Rubber price” was scraped from IndexMudi Natural Rubber Price.

这个互动地图集中展示的所有内容都是有科学依据的。如果你想知道风险专题地图是如何编制的，Antje Ahrends博士和其他合作者有一篇同行评审的论文，发表在2015年的国际期刊《全球环境变化》。“平均天然橡胶产量（公斤/公顷/年）”和“8国家最低工资（元/天）”的数据来自国际劳工组织（ILO，2014年）和联合国粮农组织。“10年橡胶价格”来自于天然橡胶的价格indexmudi。

Dr. Chuck Cannon and I are wrapping up a peer-reviewed journal article to explain the data collection, analysis, and policy recommendations based on the results, and we will share the link to the article once it’s available. Dr. Xu Jianchu and Su Yufang have shaped and provided guidance to shape the online map tool development. We could not gather the datasets and put insights to see how we could cultivate, manage, and invest in natural rubber responsibly without other scientists and researchers study and contribute to field for years. We appreciated Wildlife Conservation Society, many other NGOs and national department of rubber research in Thailand and Cambodia for their supports during our field investigation in 2015 and 2016.

Chuck Cannon博士和我正在撰写一篇同行评议的科研期刊文章，用来解释该地图集生成的数据收集、分析等等，还包括了政策建议。文章一旦发表，我们会和您分享文章的链接。许建初博士和苏宇芳博士为在线地图集的开发提供了非常宝贵的意见和建议。我们无法收集数据集、并在没有其他科学家和研究人员的研究和贡献的情况下深入了解如何才能负责任地种植、管理和投资天然橡胶。我们感谢野生动物保护协会和许多其他非政府组织，以及泰国和柬埔寨国家橡胶研究院在2015和2016年的实地调查中给予的支持。

We have two country reports for natural rubber in Thailand, and natural rubber and land conflict in Cambodia, a report support this online map tool is finalizing and we will share the link soon when it’s ready.

我们有两份关于泰国天然橡胶和柬埔寨天然橡胶和土地利用冲突的国家报告，一份支持这一在线地图工具的报告正在定稿，我们将很快分享这一链接。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Technical sides 技术层面

The research and analysis were done in R, and you could find my code here.

The visualization is purely coded in R too, isn’t R is such an awesome language? You could see my code for the visualization here.

研究和分析是利用R完成的，您可以在这里找到我的代码。

可视化地图也是在R中利用纯编码编写的，难道R不是一个很棒的语言吗？你可以在这里看到我的可视化代码。

To render geojson format of multi-polygon, you should use:

library(rmapshaper)
county_json_simplified <- ms_simplify(<your geojson file>)

My original geojson for 4000+ county weights about 100M but this code have help to reduce it to 5M, and it renders much faster on Rpubs.com.

我原来的GeoJSON 4000 +县级文件大小约100兆，但是这行代码有效的使文件降低到5兆。

I learnt a lot from this blog on manipulating geojson with R and another blog on using flexdashboard in R for visualization. Having an open source and general support from R users are great.

我从这个使用R的博客上和另一个博客的可视化学到了很多东西。开放性平台和R给予大家更大的创作空间。

Install Spark (or PySpark) on your computer

On July 24, 2017July 26, 2017 By Zhuangfang YiIn Big Data, Data Science, Pyspark, Python, SparkLeave a comment

Spark is a platform/environment to allow us to stream and parallel computing big data way faster. There are tons of resources and reading you would do to know more about Spark, so I will just dive into the installation and simple code for running pyspark on counting and sorting words from a book. Basically, just get to know what the keywords or most frequent words for a book.

I wanna use pyspark on my local machine OSX. Pyspark is a library that marriage between python and spark.

To install Pyspark, you could just ‘pip install pyspark’, but you have to install Java first. Go here to see the full detail of pyspark installation.

After pip-install, I ran into an error said “No Java runtime present, requesting install.”. If you encounter the same error, you could refer to this stackoverflow post. I basically added “export JAVA_HOME=”/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home” on my mac terminal. It solved the error and I was able to run spark on my computer.


import re
from pyspark import SparkConf, SparkContext

def normalizeWords(text):
 return re.compile(r'\W+', re.UNICODE).split(text.lower())

conf = SparkConf().setMaster("local").setAppName("WordCount")
sc = SparkContext(conf = conf)

input = sc.textFile("book.txt")
words = input.flatMap(normalizeWords)

wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
wordCountsSorted = wordCounts.map(lambda x: (x[1], x[0])).sortByKey()
results = wordCountsSorted.collect()

for result in results:
 count = str(result[0])
 word = result[1].encode('ascii', 'ignore')
 if (word):
 print(word.decode() + ":\t\t" + count)

Re library is a text mining/regular expression in Python, and for other choices, you could use Spacy or NLTK instead of (or together with) Re library too.

If you wanna learn more pyspark, I recommend Frank Kane, he has an excellent online course on Spark.

A time series Stock API development with Python Bokeh and Flask to Heroku

On June 18, 2017June 25, 2017 By Zhuangfang YiIn APP development, Big Data, Data mining, Data Science, Python, The Data Incubator, Time SeriesLeave a comment

My final API looks like this:

Stock_APP_V2

You could search the stock here on my API link: http://zhuangfangyistockapp.herokuapp.com/index

If you’re interested in looking for more ticker symbols for company stock, you could go here.

For example, if you wanna search the ticker code for a company, using “B” instead of Barnes for Barnes Group. It has to be entered an upper case symbol code like the following table:

E1E1DF8F-A686-49F3-9FE3-D768E0024A4C

It’s not a most beautiful and amazing APP, but through hours of coding in Python just make me appreciated how much work and how amazing like Ameritrade is. Making an online data visualization tool is not an easy job, especially when you wanna render data from another sites or database.

To be honest, I would have made a better looking and searching engine with Shiny R in more efficient way, but since this API is my milestone project with The Data Incubator (even before the program is started on Jun. 19, 2017 ), and we are only allowed to use Flask, Bokeh, and Jinja with Python, and deploy the API to Heroku. Here we go, this is the note that would help you or remind me later when I need to develop another API using Python.

First, go to Quandl.com to register an API key, since the API will render data from Quandl.

Second, know how to request Data from Quandl.com. You could render data: 1) using Request library or simplejson to request JSON dataset from Quandl; 2) you could use quandl python library. I requested data using the quandl library because it’s much easy to use.

Third, to develop a Flask framework that could plot dataset from user’s ticker input. See the following Flask framework:


from flask import Flask, render_template,request,redirect
import quandl as Qd
import pandas as pd
import numpy as np
import os
import time
from bokeh.io import curdocfrom bokeh.layouts import row, column, gridplot
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import PreText, Select
from bokeh.plotting import figure, show, output_file
from bokeh.embed import components,file_html
from os.path import dirname, join
app = Flask(__name__)
app.vars={}
###Load data from Quandl
# Here define your dateframe
@app.route("/plot", methods=['GET','POST']) &amp;amp;nbsp; &amp;amp;nbsp;
# Here define the plot you plot.#e.g
def plot():
###### load dataframe and plot it out plot = create_figure(mydata, current_feature_name);
script, div = components(plot)
return render_template('Plot.html', script=script, div=div)

@app.route('/', methods=['GET','POST'])
def main():
return redirect('/plot')
if __name__== "__main__":
app.run(port=33508, debug = True)

Fourth, make your Flask APP worked on your local computer, I mean it should look exactly like above API before I deployed to Heroku.My local API directory and files are organized in this way:

5F853E2A-DC8A-47F0-8FD1-6CE5D8FAE297

app.py is the main python code that renders data from Quandl, plot the data with Bokeh, and bound it with Flask framework to deploy to Heroku.

Fifth, Push everything above to a Github repository, using Git-CLI command lines:

git init
git add .
git commit -m 'initial commit'
heroku login
heroku create ###Name of you app/web
git push heroku master

The last but not the least, in case you wanna edit your Python code or other files to update your Heroku API. You could again do:

###update heroku app from github
heroku login
heroku git:clone -a <your app name>
cd <your app name>
#make changes here and then follow next step to push the changes to heroku
git remote add <your git repository name> https://github.com/<your git username>/<your git repository name>
git git fetch <your git repository name> master
git reset --hard <your git repository name>/master
git push heroku master --force

Yeah ~ I will be with The Data Incubator (an awesome data science fellowship program) this summer

On May 26, 2017July 25, 2017 By Zhuangfang YiIn Big Data, Data mining, Data Science, The Data IncubatorLeave a comment

Two weeks ago, I found out I was ranked at top 2% of all applicants and was selected to join the Data Science Fellowship Program with The Data Incubator (TDI), I was so thrilled. I applied it once around Aug. last year, and only went through the semi-finalist and did not get a chance to go further. I reapplied it again around April this year and found out I was in their semi-finalist again right before Ben and I flew to South Africa to meet our good friends for a rock climbing trip.

Let me give you a bit info about TDI data science fellowship program first. It is “an intensive eight-week bootcamp that prepares the best science and engineering PhDs and Masters to work as data scientists and quants. It identifies Fellows who already have the 90% difficult-to-learn skills and equips them with the last 10%”. The applicant went through three ‘selections’. You apply through their website (here), and the qualified semifinalists are identified by TDI. Then all the semifinalists are in computer programming, math & statistics, and modeling skill test. For this stage, TDI further identifies finalists through semifinalists’ programming, problem-solving skills for real-world problems. As a finalist, you will be interviewed for the data science communication skills with other finalists, and TDI team will decide if you get in the program a week after the interview. About 25% of applicants (~2000 applicants) are selected as semifinalists and 3% are selected as fellows and scholars. See the figure I made bellow (this is only according to the best knowledge I have for the program).

Fellowship Program

Back to my story ;-). Since we were actually at Rockland, South Africa to start our exciting bouldering journey. I was pretty disappointed about giving up 2 or 3 days out of 8 days of our vacation for the programming, problem-solving test. In addition to that, I have to propose and build an independent data science project. I thought about just postponing or canceling my semifinalist opportunity, and enjoyed the vacation because our wifi was so spotty at the rural South Africa anyway. But I’m glad I did not just give it up. It literally took me 7 or 8 hours in our guest house there to download a 220M dataset from TDI for the test. I was thinking about using my Amazon cloud computer for my independent project, but the internet wasn’t very helpful.

201607011610324f7c3

I basically only used the wifi and uploaded my files and answers while everyone left the guest house for their rock climbings, and the best spot for wifi was in our bathroom, lol~~~ uploading a 15M file took me about four hours with multiple fails. LOL…

Luckily, things worked out, and I can’t wait to join TDI’s summer fellow cohort. I’m super excited about learning more advanced machine learning, distributed computing (Spark, Hadoop and MapeReduce) with the smart data brains fellows.

Wish me luck!!!

Some pictures of Ben, Pete, me and our other friends’ rock climbing pictures here, and let’s rock through our 2017.

34474051975_eb809fe331_b 33631141504_e7edb32d51_b 34438773036_e7f356cda5_h 34560732195_b45c19f388_b 34349560771_ef4c215ecd_h

Photo Credits: Ben ;-).

34427308326_d2defdbe10_k 34430489451_ea2b16dc2d_k 18194177_10210045188829569_4652567858509764791_n

18268270_10210088323427907_5126716707500558209_n

Pete got me(the tiny green bug on the rock ;-)) climbing up a wall at Cape Town local climb.

This basically our best vacation so far, and I am glad I made it through TDI and was able to enjoy the climbing after the test. Our friends Pete and Corlie arranged the whole trip and we’re glad we made all the way to the amazingly beautiful South Africa.

Artificial intelligence on urban tree species identification 人工智能在市区树种识别上的应用

On May 12, 2017May 12, 2017 By Zhuangfang YiIn AWS, Big Data, Data Science, Deep Learning, Ecological Economics, Environmental studies, Imagery analysis, Machine Learning, Neural Network, Python, QGIS, Satellite Imagery ProcessingLeave a comment

It doesn’t matter which part of the world you are living now, very diverse tree species are planted around the urban area we live. Trees in the urban areas have many functions, for example, trees provide habitats for wildlife, clean air and water, provide significant health and social benefits, and also improve property value too. Wake up in a beautiful morning that birds are singing outside your apartment because you have many beautiful trees grow outside of your space. How awesome is that!

However, tree planting, survey, and species identification require an enormous amount of work that literally took generations and years of inputs and care. What if we could identify tree species from satellite imagery, how much faster and how well we could get tree species identified and also tell their geolocations as well.

A city has its own tree selection and planting plan, but homeowners have their own tree preference, which the identification work a bit complicated, though.

chicagoTrees

(Photo from Google Earth Pro June 2010 in Chicago area)

It’s hard to tell now how many tree species are planted in above image. But we could (zoom in and) tell these trees actually have a slightly different shape of tree crown, color, and texture. From here I only need to have a valid dataset basically tell me what tree I am looking at now, which is a tree survey and trees geolocation records from the city. I will be able to teach a computer to select similar features for the species I’m interested in identifying.

GreeAsh

These are Green Ash trees (I marked as green dots here).

These are Littleleaf Linden, they are marked as orange dots.

Let me run a Caffe deep learning model (it’s one of the neural networks and also known as artificial intelligence model) for an image classification on these two species, and see if the computer could separate these two species from my training and test datasets.

Great news that the model could actually tell the differences between these two species. I run the model for 300 epochs (runs) from learning rate 0.01 to 0.001 on about 200 images for two species. 75% went to train the model and 25% for testing. The result is not bad that we have around 90% of accuracy (orange line) and less than 0.1 loss on the training dataset.

nvidia_d_modeltest

I threw a random test image to the model (a green ash screenshot in this case) and it tells the result.

test_trees2

I will be working on identifying other 20 trees species and their geolocations next time.

Let’s get some answer what trees are planted in Chicago area and how it related to the property value (an interesting question to ask), and also what ecological benefits and functions these tree are providing (leave this to urban ecologist if my cloud computer could identify the species)? Check my future work ;-).

Can artificial intelligence help us identify wildfire damage from satellite imagery faster? 我们能否借助人工智能算法快速地从卫星影响中定位火灾损毁地点和损毁程度？

On April 18, 2017April 19, 2017 By Zhuangfang YiIn Big Data, Data Science, Data visualization, Deep Learning, Machine Learning, Neural Network, Python, Spatial Analysis2 Comments

The following work was done by me and Dr. Shay Strong, while I was a data engineer consultant under the supervision of Dr. Strong at OmniEarth Inc. All the work IP rights belong to OmniEarth. Dr Strong is the Chief Data Scientist at OmniEarth Inc.

以下要介绍的工作是我在OmniEarth公司做数据工程师的时候和Shay Strong博士共同完成的工作。工作的知识产权归OmniEarth公司所有，我的老板Shay Strong博士是OmniEarth公司的数据科学家团队的领头人。

A wildfire had been burning in the Great Smoky Mountains of Tennessee and raced rapidly northward toward Gatlinburg and Pigeon Forge between late Nov. and Dec. 2nd, 2016. At least 2000 buildings were damaged or destroyed across 14,000 acres of residential and recreational land, while the wildfire also claimed 14 lives and injured 134. It was the largest natural disaster in the history of Tennessee.

2016年11月到12月田纳西州的大烟山国家公园森林（Great Smoky Mountains）大火，随后火势蔓延至北部的两个地区Gatlinburg 和Pigeon Forge。据报道大火损毁2000多栋包括民宅和旅游区建筑物，损毁面积达到1万4千英亩，火灾致使14人死亡134人受伤。被认为是田纳西州历史上最大的自然灾害。

After obtaining 0.4 m resolution satellite imagery of the wildfire damage in Gatlinburg and Pigeon Forge from Digital Global, OmniEarth Inc created an artificial intelligence (AI) model that was able to assess and identify the property damage due to the wildfire. This AI model will also be able to more rapidly evaluate and identify areas of damage from natural disasters from similar issues in the future.

从Digital Global获得大约为0.4米分辨率的高分辨率遥感图像（覆盖了火灾发生的Gatlinburg 和Pigeon之后）我们建立了人工智能模型。该人工智能模型可以帮助我们快速定位和评估火宅受灾面积和损毁程度。我们希望该模型未来可帮助消防人员快速定位火灾险情和火灾受损面积。

The fire damage area was identified by the model on top of the satellite images.

该地图链接是我们的人工智能模型生成的火灾受损地区热图在卫星地图上的样子：http://a.omniearth.net/wf/。

2017-01-26 22.15.10.gif

Fig 1. The final result of fire damage range in TN from our AI model. 该图是通过人工智能模型生成的火灾受灾范围图。

1. Artificial intelligence model behind the wildfire damage火灾模型背后的人工智能

With assistance from increasing cloud computing power and a better understanding of computer vision, more and more AI technology is helping us detect information from trillions of photos we produce daily.计算机图像识别和云计算能力的提升，使得我们能够借助人工智能模型获取数以万计甚至亿计的照片地图等图片中获取有用的信息。

Before diving into the AI model behind the wildfire damage, in this case, we only want to identify the differences between fire-damaged buildings and intact buildings. We have two options: (1), we could spend hours and hours browsing through the satellite images and manually separate the damaged and intact buildings or (2) develop an AI model to automatically identify the damaged area with a tolerable error. For the first option, it would easily take a geospatial technician more than 20 hours to identify the damaged area among the 50,000 acres of satellite imagery. The second option poses a more viable and sustainable solution in that the AI model could automatically identify the damaged area/buildings less than 1 hour over the same area. This is accomplished by image classification in AI, using convolutional neural networks (CNN) specifically, because CNN works better than other neural network algorithms for object detection and recognition from images.

在深入了解人工智能如何工作之前，在解决火灾受灾面积和受损程度这个问题上，其实我们要回答的问题只有一个那就是如何在图像上区分被烧毁的房屋和没有被烧毁的房屋之间的区别。要回答这个问题，我们可以做：（1）花很长的时间手动从卫星影像中用人眼分辨受损房屋的位置；（2）建一个人工智能模型来快速定位受损房屋的位置。现在我们通常的选择是第一种，但是在解决田纳西那么多房屋损毁的卫星影像上，我们至少需要一个熟悉地理信息系统和遥感图像的技术人员连续工作至少20个小时以上才能覆盖火灾发生地区覆盖大约5万英亩大小的遥感图像。相反，如果使用人工智能模型，对于同样大小区域范围的计算，模型运行到出结果只需要少于1小时的时间。这个人工智能模型具体来说用的是卷积神经网络算法，属于图像分类和图像识别范畴。

Omniearth_satellite

Fig 2. Our AI model workflow. 我们的人工智能模型框架。

Artificial intelligence/neural networks are a family of machine learning models that are inspired by biological neurons of our human brain. First conceived in the 1960s, but the first breakthrough was Geoffrey Hinton’s work published in the mid-2000s. While our human eyes work like a camera seeing the ‘picture,’ our brain will process it and be able to construct the objects we see through the shape, color, and texture of the objects. The information of “seeing” and “recognition” is passing through our biological neurons from our eyes to our brain. The AI model we created works in a similar way. The imagery is passed through the artificial neural network, and objects that have been taught to the neural network are identified with certain accuracy. In this case, we taught the network to learn the difference between burnt and not-burnt structures in Gatlinburg and Pigeon Forge, TN.

2. How did we build the AI model

We broke down the wildfire damage mapping process into four parts (Fig 1). First, we obtained the 0.4m resolution satellite images from Digital Globe (https://www.digitalglobe.com/). We created a training and a testing dataset of 300 small images chips (as shown in Fig 3, A and B) that contained both burnt and intact buildings, 2/3 of which go to train the AI model, CNN model in this case, and 1/3 of them are for test the model. Ideally, the more training data used to represent the burnt and non-burnt structures are ideal for training the network to understand all the variations and orientations of a burnt building. The sample set of 300 is on the statistically small side, but useful for testing capability and evaluating preliminary performance.


Fig 3(A). A burnt building	Fig3(B). Intact buildings

Our AI model was a CNN model that built upon Theano (GPU backend) (http://deeplearning.net/software/theano/). Theano was created by the Machine Learning group at the University of Montreal, led by Yoshua Bengio, who is one of the pioneers behind artificial neural networks. Theano is a Python library that lets you define and evaluate mathematical expressions with vectors and matrices. As a human, you can imagine our daily decision-making is based on the matrices of perceived information as well, e.g. which car you want to buy. The AI model helps us to identify which image pixels and patterns are fundamentally different between burnt and intact buildings, similar to how people give a different weight or score to the car brand, model, and color they want to buy. Computers are great at calculating matrices, and Theano brings it to next level because it calculates multiple matrices in parallel, and so speeds up the whole calculation tremendously. Theano has no particular neural network built-in, so we use Keras on top of Theano. Keras allows us to build an AI model with a minimalist design on training layers of a neural network and run it more efficiently.

Our AI model was run on AWS EC2 with a g2.2xlarge instance type. We set the learning rate (lr) to 0.01.. A smaller learning rate will force the network to learn more slowly but may also lead to optimal classification convergence, especially in cluttered scenes where a large amount of object confusion can occur. In the end, our AI model with came out with 97% of accuracy, less than 0.3 loss over three runs within a minute, and it took less than 20 minutes to run on our 3.2G satellite images.

The model result was exported and visualized using QGIS (http://www.qgis.org/en/site/). QGIS is an open source geographic information system that allows you to create, edit, visualize, analyze and publish geospatial information and maps. The map inspection was also done through comparing our fire damage results to the briefing map produced by Wildfire Today (https://inciweb.nwcg.gov/incident/article/5113/34885/) and Incident Information System (https://inciweb.nwcg.gov/incident/article/5113/34885/).

Fig 4. (A). using OmniEarth parcel level burnt and intact buildings layout on top of the imagery.

Fig 4 (B). The burnt impact (red color) on top of the Great Smoky Mountains from late Nov. to early Dec 2016.

Satellite image classification is a challenging problem that lies at the crossroads of remote sensing, computer vision, and machine learning. A lot of currently available classification approaches are not suitable to handle high-resolution imagery data with inherent high variability in geometry and collection times. However, OmniEarth is a startup company that is passionate about the business of science and scaling quantifiable solutions to meet the world’s growing need for actionable information.

Contact OmniEarth for more information:

For more detailed information, please contact Dr. Zhuangfang Yi, email: geospatialanalystyi@gmail.com; twitter: geonanayi.

Dr. Shay Strong, email: shay.strong@omniearthinc.com; twitter: shaybstrong.

Start your own Amazon Web Service instance for deep learning怎么样开始建一个你自己的亚马逊深度学习机器

On March 30, 2017August 24, 2017 By Zhuangfang YiIn AWS, Data Science, Deep Learning, Machine Learning, Neural Network, UncategorizedLeave a comment

I am back to my blogging life after awhile~ 好久没有写博客，我又回来了！

I’ve been working on image classification and segmentation quite a lot recently, and totally in love with GPU big data processing. If you wanna process data that at gigabyte (G) level data definitely look into start a GPU AWS instance 最近我的工作接触了很多图像分类，和图像分割的内容，感觉自己太爱gpu图像分析的世界：太神速了。如果你现在处理的数据已经达到G级别了，我觉得你还是应该开一个亚马逊的ami（亚马逊的深度学习平台／机器）

It is not free, though. You definitely would start with AWS free tier, but I normally use their g or p machines. For example, if I use g2.2 x large, I will be charged about $0.65 per hour. for more information, go here. It charges by how much you use and if you are new to deep learning and just wanna run some case studies, I think it worths more than building your own GPU machine or buy a new pc with super GPU.

但是话说回来亚马逊的ami其实也不是免费的。我现在用的机器主要两种p和g。比如我现在一般用的是g2.2 x large，价格大概在0.65美金一个小时。更多的选择可以看这里。我觉得这个还是很有吸引力的，如果你只是想要跑几个学习案例的话，我觉得这个ami非常棒。总之还是比现在才在学习阶段，就买台有gpu的电脑或者建自己的gpu机器学习平台有用。

AWS_Charge

You should definitely do some research on: 在去开个亚马逊深度学习ami之前，我觉得大家该想想：

What do you wanna do with the AWS machine? Do you wanna learn just some basic machine learning stuffs that you only need to process megabyte (?M) level csv/txt data file you could just use your personal computer. A personal computer is fast enough though days. 你想拿这个亚马逊深度学习平台来做什么？如果只是用来处理几兆几十兆的数据的话，那还是没有必要开一个，现在的个人电脑那么快完全可以处理这些数据了。
As I mentioned above, if you wanna process images or data that above some certain level your personal computer could not handle. Think about how much you wanna spend on the data processing. Again, evaluate your situation, needs and do some research. 但是，如果你的数据量已经是在几百兆或者g级别的，当然还是很有必要开一个的。话说回来，还是应该做些调查研究加上考量自己的情况。

My needs for this personal AWS EC2 machine are: 我需要这个亚马逊ami深度学习平台，主要是想用来做：

Processing big data set on neural network image classification and segmentation;图像分类和图像分割；
A machine that has Tensorflow, Theano, Torch, Keras, and also Caffe installed. Tensorflow, Theano, Torch, and Caffe are deep learning ecosystem/environment. Keras is the python module that I use to build deep learning algorithm architecture.想这个ami机器上有我想用的几个深度学习框架，比如Tensorflow, Theano, Torch, and Caffe。还有如果有keras，python的一个构建深度学习／机器学习的包。

If you are thinking about doing the same things, this is a great blog to start your own AWS AMI Instance here or this one. They both have explicit instructions on how to star the instance.

如果你觉得我的博客还不是很清楚，这两个博客有非常好的步骤教你一步一步的开始怎么建一个亚马逊的深度学习ami机器。第一个博客，和第二个。

Second options of launching an AWS AMI with a jupyter notebook server without going through all the AWS web console. Using the following command line in your terminal:

startJupyterNotebookServer

Copy and paste the following command lines (CLI) from above figure.

# create security group
aws ec2 create-security-group –group-name JupyterSecurityGroup –description “My Jupyter security group”

# add security group rules
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 8888 —cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 22 —cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 443 —cidr 0.0.0.0/0

# launch instance
aws ec2 run-instances –image-id ami-41570b32 –count 1 –instance-type p2.xlarge –key-name <YOUR_KEY_NAME> –security-groups JupyterSecurityGroup

The next thing would be to configure your Jupyter Notebook Server:

cert

jupyter notebook –generate-config
key=$(python -c “from notebook.auth import passwd; print(passwd())”)

cd ~
mkdir certs
cd certs
certdir=$(pwd)
openssl req -x509 -nodes -days 365 –newkey rsa:1024 –keyout mycert.key -out mycert.pem

cd ~
sed -i “1 a\
c = get_config()\\
c.NotebookApp.certfile = u’$certdir/mycert.pem’\\
c.NotebookApp.keyfile = u’$certdir/mycert.key’\\
c.NotebookApp.ip = ‘*’\\
c.NotebookApp.open_browser = False\\
c.NotebookApp.password = u’$key’\\
c.NotebookApp.port = 8888″ .jupyter/jupyter_notebook_config.py

These CLI are to create your AWS AMI certificate for Jupyter Notebook server, and then you could run and test out if your jupyter notebook works, after seccessfully run above CLI.

screen -S jupyter
mkdir notebook
cd notebook
jupyter notebook

For more info you could see this blog for details.

If you wanna use Ubuntu AMI instead of Amazon AMI here is another good blog for setting up the jupyter notebook server on the machine

https://chrisalbon.com/jupyter/run_project_jupyter_on_amazon_ec2.html

	Zhuangfang Yi on Can artificial intelligence he…
	王凯 on Can artificial intelligence he…
	geoyi on Data-driven city planning: Us…
	Andrea Cirillo on Data-driven city planning: Us…
	location, location,… on location, location, more locat…

	Zhuangfang Yi on Can artificial intelligence he…
	王凯 on Can artificial intelligence he…
	geoyi on Data-driven city planning: Us…
	Andrea Cirillo on Data-driven city planning: Us…
	location, location,… on location, location, more locat…

Geeky Mappy Geoyi

Category: Data Science