Label Maker:四个命令行教你如何生成卫星影像图机器学习训练数据集

This is Chinese version, if you have not seen the blog (in English) yet, go here: https://developmentseed.org/blog/2018/01/11/label-maker/

Label Maker is a python library to help in extracting insight from satellite imagery. Label Maker creates machine-learning-ready training data for most popular ML frameworks, including Keras, TensorFlow, and MXNet. It pulls data from OpenStreetMap and combines that with imagery sources like Mapbox or Digital Globe to create a single file for use in training machine learning algorithms.

简介:

Label Maker 是我们最近开发的开源python软件包,帮助大家更好更深刻的理解卫星影像。Label Maker 可以生成可连接到任何你喜爱的机器学习(或深度学习),比如目前最流行的谷歌TensorFlow, MXNet,用Keras来编程毫无障碍。我们的软件包从 OpenStreetMap 和Mapbox 或者Digital Globe 上面获取数据,生成训练数据集。如果有同学写出可以链接到起他卫星影像数据源上,我们也非常欢迎大家修改和提交程序到我们的GitHub repo上面。另外,如果你想学习如何做对象检测(object detection)或者影像分类(image classification)我们也给大家准备了各种例子,欢迎使用欢迎留言反馈呀。

下面是正文啦!

ob_tf_result_fig1

机器学习和深度学习算法在计算机视觉上的应用日新月异。传统的卫星影像解译非常快速也方便,比如大家可以用ERDAS,ArcGIS等做。但是这些传统的方法也有一个局限,那就是如果你的卫星影像图分辨率高一点,图片大一点了,通常这些应用软件和你的台式电脑可能就跑不动了。要快速有效的解决这些问题怎么办?我今天就来回答这个问题,我们如何借助现在的GPU和机器学习来大规模的处理和解译卫星影像。

先来小小的了解一下,现在计算机视觉里面的深度学习大概可以分为三大类:监督学习,非监督学习和强度递归学习(最后这个不知道咋翻译)。卫星图像解译传统的方法也有监督学习和非监督学习。监督学习可以理解为:你告诉帮你做图像解译的软件:河流,海洋,森林看起来是啥样子的,然后软件就根据你给的阈值去计算和分类。非监督学习就是你不告诉软件,软件根据给定的卫星影像帮你分类,比如河流和海洋,从红、绿、蓝三个波段看起来就是不一样,所以软件可以根据两类不通的波段阈值将其分开。

深度学习也可以做监督学习和非监督学习。刚才也说了,有传统软件,为啥咱们还要用深度学习呢,是因为最近大家都在哈这个吗?no,no,no。。。

深度学习在卫星影响上出了可以通过大量使用GUP来加快计算之外,深度学习只要一次训练之后,可以利用训练好的模型权重(trained model weight)来反复训练未知的区域。你训练和学习的次数越多,时间越长,模型的表现就更好。这个链接大家可以看看我们用机器学习解译的道路网络。道路网络解译在卫星影像解译中是最难的,我先不告诉你,你自己来猜一猜下次我来公布答案。另外我们也做了很多类似的深度学习在卫星影响上的应用相关的例子,比如找房子就用到了TensorFlow对象检测利用MXNet和亚马逊SageMaker来做的分类模型另外一个利用Keras和亚马逊云计算机分类模型。

废话说了那么多,照着现在深度学习发展的速度,开发新的算法其事不算太难。难的是怎么准备机器学习和深度学习可以使用训练数据集。

我今天隆重的来给大家介绍一下我们的pytho 软件包Label Maker。Label Maker是个开源的软件所以在github上面大家随意给我们点赞和folk,我们也鼓励大家踊跃贡献。Label Maker通过获取Mapbox的卫星影像和OpenStreetMap的矢量数据(比如道路,房子,树林)等等,打包和生成训练数据。大家可以把这个数据链接到自己最喜爱的省督学习和机器学习上面。Label Maker模型只需要五个命令行就帮你生成训练数据集了哦。

只要 pip install label_maker之后,跑这四行命令行就可以了。

label-maker download         # download OpenStreetMap QA Tiles
label-maker labels           # create your ground-truth labels
label-maker images           # download satellite imagery tiles
label-maker package          # package tiles and labels into data.npz

当然我省略了两小个步骤:

比如要从Mapbox上下载卫星影像图片,你要有一个他们影像API的token,所以去注册一个Mapbox的账号吧。

然后使用上面的四个命令行之前,要生成一个配置文件(configure file),先这样的:

 

{
  "country": "vietnam",
  "bounding_box": [105.42,20.75,106.41,21.53],
  "zoom": 17,
  "classes": [
    { "name": "Buildings", "filter": ["has", "building"] }
  ],
  "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=ACCESS_TOKEN",
  "background_ratio": 1,
  "ml_type": "classification"
}

我们的python软件包会读取配置文件里面的参数来生成你需要的训练数据集。记得在把配置文件中的ACCESS_TOKEN 替换成你mapbox上生成的token哟。

等上面四个命令行顺利跑完成之后,你就有了data.npz就可以跑你最爱的机器学习算法了。比如下面这样:

 

# the data, shuffled and split between train and test sets
npz = np.load('data.npz')
x_train = npz['x_train']
y_train = npz['y_train']
x_test = npz['x_test']
y_test = npz['y_test']

# define your model here, example usage in Keras
model = Sequential()
# ...
model.compile(...)

# train
model.fit(x_train, y_train, batch_size=16, epochs=50)
model.evaluate(x_test, y_test, batch_size=16)

想要了解更详细的信息不要忘了访问我们的GitHub请大家不令赐赞👍和✨吧。

How to use the online map tool for investing in sustainable rubber cultivation in tropical Asia如何利用在线地图工具投资热带亚洲可持续天然橡胶种植

Please go ahead and play with the full-screen map here.

This map Application is developed to support the Guidelines for Sustainable Development of Natural Rubber, which led by China Chamber of Commerce of Metals, Minerals & Chemicals Importers & Exporters with supports from World Agroforestry Centre, East and Center Asia Office (ICRAF). Asia produces >90% of global natural rubber primarily in monoculture for highest yield in limited growing areas. Rubber is largely harvested by smallholders in remote, undeveloped areas with limited access to markets, imposing substantial labor and opportunity costs. Typically, rubber plantations are introduced in high productivity areas, pushed onto marginal lands by industrial crops and uses and become marginally profitable for various reasons.

请在这里播放全屏地图

这个应用地图集的开发是为了支持由中国五矿化工进出口商会和世界农用林业中心等部门联合编制的《可持续天然橡胶指南》。亚洲天然橡胶的产量占全球的90%,且主要是在有限的种植地区内,通过单一的种植,达到最高的产量。橡胶主要是由小农户在偏远的、欠发达的、市场有限的地区通过利用大量的劳动力和机会成本获得的。一般来说,橡胶只应该种植在高产量的地区,但已经被工业化的发展推到了在边缘土地上种植,并因种种原因已经边缘到无利可图。

Rubberplantation

Fig. 1. Rubber plantations in tropical Asia. It brings good fortune for millions of smallholder rubber farmers, but it also causes negative ecological and environmental damages.

图1:亚洲热带橡胶种植园。它给数以万计的小橡胶农民带来收入,但它也造成了负面的生态和环境的破坏。

The online map tool is developed for smallholder rubber farmers, foreign and domestic natural rubber investors as well as different level of governments.  

The online map tool entitled “Sustainable and Responsible Rubber Cultivation and Investment in Asia”, and it includes two main sections: “Rubber Profits and Biodiversity Conservation” and “Risks, SocioEconomic Factors, and Historical Rubber Price”.

该在线地图工具开发是为了小胶农、国内外天然橡胶投资者以及政府层面的政府使用。

这个标题为“亚洲可持续和负责任的天然橡胶种植和投资”的在线地图工具,包括两个主要部分:“橡胶利润和生物多样性保护”和“风险、社会经济因素和历史橡胶价格”。

The main user interface looks like the graph (Fig 2). There are 4 theme graphs and maps.

主用户界面看起来像图表(见图2)。有4个主题图和地图。

p1_section intro

Fig. 2. The main user interface of the online map tool.

图2:在线地图工具的主要用户界面。包括上图可见的“简介”,“第一部分”,“第二部分”,和“社交媒体分享”。

. Section 1 第一部分内容

This graph tells the correlation between “Minimum Profitable Rubber (USD/kg)” (the x-axis of the graph, and “Biodiversity (total species number)” in 2736 county that planted natural rubber trees in eight countries in tropical Asia.  There are 4312 counties in total, and in this map tool, we only present county that has the natural rubber cultivated.

这张图显示了亚洲热带地区八个国家种植天然橡胶树的2736个县的最低橡胶成本(美元/千克)(图的X轴)和生物多样性(总种数)之间的关系。共有4312个县,在这个地图工具中,我们只提供了有天然橡胶种植的2736县相关的内容。

p1_section intro_high

Fig. 3. How to read and use the data from the first graph. Each dot/circle represents a county, the color, and size of it indicates the area of natural rubber are planted. When you move your mouse closer to the dot, you will see “(2.34, 552) 400000 ha @ Xishuangbanna, China”, 2.34 is the minimum profitable rubber price (USD/kg), 552 is the total wildlife species including amphibians, reptiles, mammals, and birds.  “400000 ha” is the total area of planted natural rubber plantation from satellite images between 2010 and 2013. “@ Xishuangbanna, China” is the geolocation of the county. 

图3:如何阅读和使用第一个图中的数据。每个圆点/圆代表一个县,其颜色和大小表示天然橡胶种植面积。当你移动你的鼠标时,比如你会看到“(2.34,552)400000公顷的“西双版纳、中国”,2.34是最低盈利(成本)橡胶价格(美元/公斤),552是总的野生物种,包括两栖动物、爬行动物、哺乳动物和鸟类。“400000公顷”是2010~2013年间卫星影像种植天然橡胶种植园的总面积。“西双版纳、中国”是本县的地理位置。

Don’t be shy, please go ahead and play with the full-screen map here. The minimum profitable rubber price is the market price for national standard dry rubber products that would help you to start makes profits. For example, if the market price of natural rubber is 2.0 USD/kg in the county your rubber plantation located, but your minimum profitable rubber price is 2.5 USD/kg means you will lose money by just producing rubber products. However, if your minimum profitable rubber price is 1.5 USD/kg means you will still make about 0.5 USD/kg profit from your plantation.

请不要拘谨,可以在这里浏览全屏地图。最低橡胶成本换算成国家标准的干橡胶产品的市场价格,这将有助于你理解您所属橡胶园的盈利起始点。例如,如果你所在的橡胶种植区的天然橡胶市场价格是2美元/公斤,但你的最低成本橡胶价格是2.5美元/公斤,意味着你生产橡胶产品就会亏本。然而,如果你的最低成本的橡胶价格是1.5美元/公斤意味着你的种植园仍然会赚约0.5美元/公斤的利润。

The county that has a lower minimum profitable price for natural rubber is generally going to make better rubber profit in the global natural rubber market. However, as scientists behind this research, we hope that when you rush to invest and plant rubber in a certain county, please also think about other risks, e.g. biodiversity loss, topographic, tropical storm, frost as well as drought risks. They are going to be shown later in this demonstration. 

那些天然橡胶经营平均成本最低的县,在全球天然橡胶市场上将获得较好的橡胶利润。然而,作为这项研究背后的科学家,我们希望,当你在某个县匆忙投资成本较低的县市种植橡胶时,也要考虑其他风险,例如生物多样性丧失、地形、热带风暴、霜冻以及干旱风险。这些将被显示在这个演示之后。

p2_section intro_high.gif

Fig. 4.  The first map is the “Rubber Cultivation Area”, which shows the each county that has rubber trees from low to high in colors from yellow to red. The second map “Minimum Profitable Rubber Price”(USD/kg), again the higher the minimum profitable price is the fewer rubber profits that farmers and investors are going to receive. The third map is ” Biodiversity (Amphibians, Reptiles, Mammals, and Birds)”,  data was aggregated from IUCN-Redlist and BirdLife International.

图4:第一张地图是“橡胶种植区”,它显示了每个县的橡胶树种植数量从低到高的颜色,即从黄色到红色。第二张图“最低成本”(美元/千克),橡胶的平均成本越高,橡胶园的经营者就会获得更少的利润。第三地图是“生物多样性(两栖动物、爬行动物、哺乳动物和鸟类)”,数据来自世界自然保护联盟红色名录IUCN-Redlist和国际鸟盟聚集BirdLife International

. Section 2 第二部分

We also demonstrated different types of risks that investors and smallholder farmers would face when they invest and plant rubber trees. Rubber tree doesn’t produce rubber latex before 7 years old, and the tree owners won’t make any profit until the tree is around 10 years old in general. In this section, we presented “Topographic Risk”, ” Tropical Storm”, “Drought Risk”,  and “Frost Risk”.

我们还展示了投资者和小农投资种植橡胶树时会面临的不同风险类型。橡胶树种植前7年在橡胶树不生产任何胶乳的情况下是没有任何盈利的,甚至橡胶园的经营者一般在橡胶树种下10年之前都不会获利。该部分中,我们提出了“地形风险”、“热带风暴”、“干旱风险”和“霜冻风险”。

p3_section intro_high.gif

Fig. 5. Section 2 ” Risks, SocioEconomic Factors and Historical Rubber Price” has seven different theme maps and interactive graphs. They are “Topographic Risk”, ” Tropical Storm”, “Drought Risk”,  and “Frost Risk”, “Average Natural Rubber Yield (kg/ha.year)”, “Minimum Wage for the 8 Countries (USD/day)”, and ” 10 years Rubber price”.

图5:第2节“风险、社会经济因素和橡胶价格历史”有七种不同的主题地图和互动图表。它们是“地形风险”、“热带风暴”、“干旱风险”、“霜冻风险”、“平均天然橡胶产量(千克/公顷)”、“8个国家的最低工资(美元/天)”和“10年橡胶价格”。

If you are interested in how the risk theme maps were produced, Dr. Antje Ahrends and her other coauthors have a peer-reviewed article published in Global Environmental Change in 2015.  “Average Natural Rubber Yield (kg/ha.year)” and “Minimum Wage for the 8 Countries (USD/day)” dataset was obtained from  International Labour Organization (ILO, 2014)  and FAO.” 10 years Rubber price” was scraped from  IndexMudi Natural Rubber Price.

这个互动地图集中展示的所有内容都是有科学依据的。如果你想知道风险专题地图是如何编制的,Antje Ahrends博士和其他合作者有一篇同行评审的论文,发表在2015年的国际期刊《全球环境变化》。“平均天然橡胶产量(公斤/公顷/年)”和“8国家最低工资(元/天)”的数据来自国际劳工组织(ILO,2014年)和联合国粮农组织。“10年橡胶价格”来自于天然橡胶的价格indexmudi。

Dr. Chuck Cannon and I are wrapping up a peer-reviewed journal article to explain the data collection, analysis, and policy recommendations based on the results, and we will share the link to the article once it’s available. Dr. Xu Jianchu and Su Yufang have shaped and provided guidance to shape the online map tool development. We could not gather the datasets and put insights to see how we could cultivate, manage, and invest in natural rubber responsibly without other scientists and researchers study and contribute to field for years. We appreciated Wildlife Conservation Society, many other NGOs and national department of rubber research in Thailand and Cambodia for their supports during our field investigation in 2015 and 2016.

Chuck Cannon博士和我正在撰写一篇同行评议的科研期刊文章,用来解释该地图集生成的数据收集、分析等等,还包括了政策建议。文章一旦发表,我们会和您分享文章的链接。许建初博士和苏宇芳博士为在线地图集的开发提供了非常宝贵的意见和建议。我们无法收集数据集、并在没有其他科学家和研究人员的研究和贡献的情况下深入了解如何才能负责任地种植、管理和投资天然橡胶。我们感谢野生动物保护协会和许多其他非政府组织,以及泰国和柬埔寨国家橡胶研究院在2015和2016年的实地调查中给予的支持。

We have two country reports for natural rubber in Thailand, and natural rubber and land conflict in Cambodia, a report support this online map tool is finalizing and we will share the link soon when it’s ready.

我们有两份关于泰国天然橡胶柬埔寨天然橡胶和土地利用冲突的国家报告,一份支持这一在线地图工具的报告正在定稿,我们将很快分享这一链接。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Technical sides 技术层面

The research and analysis were done in R, and you could find my code here.

The visualization is purely coded in R too, isn’t R is such an awesome language? You could see my code for the visualization here.

研究和分析是利用R完成的,您可以在这里找到我的代码

可视化地图也是在R中利用纯编码编写的,难道R不是一个很棒的语言吗?你可以在这里看到我的可视化代码。

To render geojson format of multi-polygon, you should use:

library(rmapshaper)
county_json_simplified <- ms_simplify(<your geojson file>)

My original geojson for 4000+ county weights about 100M but this code have help to reduce it to 5M, and it renders much faster on Rpubs.com.

我原来的GeoJSON 4000 +县级文件大小约100兆,但是这行代码有效的使文件降低到5兆。

I learnt a lot from this blog on manipulating geojson with R and another blog on using flexdashboard in R for visualization. Having an open source and general support from R users are great.

我从这个使用R的博客上和另一个博客的可视化学到了很多东西。开放性平台和R给予大家更大的创作空间。

Artificial intelligence on urban tree species identification 人工智能在市区树种识别上的应用

It doesn’t matter which part of the world you are living now,  very diverse tree species are planted around the urban area we live.  Trees in the urban areas have many functions, for example, trees provide habitats for wildlife, clean air and water, provide significant health and social benefits, and also improve property value too.  Wake up in a beautiful morning that birds are singing outside your apartment because you have many beautiful trees grow outside of your space. How awesome is that!

However, tree planting, survey, and species identification require an enormous amount of work that literally took generations and years of inputs and care. What if we could identify tree species from satellite imagery, how much faster and how well we could get tree species identified and also tell their geolocations as well.

A city has its own tree selection and planting plan, but homeowners have their own tree preference, which the identification work a bit complicated, though.

chicagoTrees

(Photo from Google Earth Pro June 2010 in Chicago area)

It’s hard to tell now how many tree species are planted in above image. But we could (zoom in and) tell these trees actually have a slightly different shape of tree crown, color, and texture. From here I only need to have a valid dataset basically tell me what tree I am looking at now, which is a tree survey and trees geolocation records from the city. I will be able to teach a computer to select similar features for the species I’m interested in identifying.

GreeAsh

These are Green Ash trees (I marked as green dots here).

LittleleafLiden.png

These are Littleleaf Linden, they are marked as orange dots.

Let me run a Caffe deep learning model (it’s one of the neural networks and also known as artificial intelligence model) for an image classification on these two species, and see if the computer could separate these two species from my training and test datasets.

Great news that the model could actually tell the differences between these two species. I run the model for 300 epochs (runs) from learning rate 0.01 to 0.001 on about 200 images for two species. 75% went to train the model and 25% for testing. The result is not bad that we have around 90% of accuracy (orange line) and less than 0.1 loss on the training dataset.

nvidia_d_modeltest

I threw a random test image to the model (a green ash screenshot in this case) and it tells the result.

test_trees2

I will be working on identifying other 20 trees species and their geolocations next time.

Let’s get some answer what trees are planted in Chicago area and how it related to the property value (an interesting question to ask), and also what ecological benefits and functions these tree are providing (leave this to urban ecologist if my cloud computer could identify the species)? Check my future work ;-).