Label Maker:四个命令行教你如何生成卫星影像图机器学习训练数据集

This is Chinese version, if you have not seen the blog (in English) yet, go here: https://developmentseed.org/blog/2018/01/11/label-maker/

Label Maker is a python library to help in extracting insight from satellite imagery. Label Maker creates machine-learning-ready training data for most popular ML frameworks, including Keras, TensorFlow, and MXNet. It pulls data from OpenStreetMap and combines that with imagery sources like Mapbox or Digital Globe to create a single file for use in training machine learning algorithms.

简介:

Label Maker 是我们最近开发的开源python软件包,帮助大家更好更深刻的理解卫星影像。Label Maker 可以生成可连接到任何你喜爱的机器学习(或深度学习),比如目前最流行的谷歌TensorFlow, MXNet,用Keras来编程毫无障碍。我们的软件包从 OpenStreetMap 和Mapbox 或者Digital Globe 上面获取数据,生成训练数据集。如果有同学写出可以链接到起他卫星影像数据源上,我们也非常欢迎大家修改和提交程序到我们的GitHub repo上面。另外,如果你想学习如何做对象检测(object detection)或者影像分类(image classification)我们也给大家准备了各种例子,欢迎使用欢迎留言反馈呀。

下面是正文啦!

ob_tf_result_fig1

机器学习和深度学习算法在计算机视觉上的应用日新月异。传统的卫星影像解译非常快速也方便,比如大家可以用ERDAS,ArcGIS等做。但是这些传统的方法也有一个局限,那就是如果你的卫星影像图分辨率高一点,图片大一点了,通常这些应用软件和你的台式电脑可能就跑不动了。要快速有效的解决这些问题怎么办?我今天就来回答这个问题,我们如何借助现在的GPU和机器学习来大规模的处理和解译卫星影像。

先来小小的了解一下,现在计算机视觉里面的深度学习大概可以分为三大类:监督学习,非监督学习和强度递归学习(最后这个不知道咋翻译)。卫星图像解译传统的方法也有监督学习和非监督学习。监督学习可以理解为:你告诉帮你做图像解译的软件:河流,海洋,森林看起来是啥样子的,然后软件就根据你给的阈值去计算和分类。非监督学习就是你不告诉软件,软件根据给定的卫星影像帮你分类,比如河流和海洋,从红、绿、蓝三个波段看起来就是不一样,所以软件可以根据两类不通的波段阈值将其分开。

深度学习也可以做监督学习和非监督学习。刚才也说了,有传统软件,为啥咱们还要用深度学习呢,是因为最近大家都在哈这个吗?no,no,no。。。

深度学习在卫星影响上出了可以通过大量使用GUP来加快计算之外,深度学习只要一次训练之后,可以利用训练好的模型权重(trained model weight)来反复训练未知的区域。你训练和学习的次数越多,时间越长,模型的表现就更好。这个链接大家可以看看我们用机器学习解译的道路网络。道路网络解译在卫星影像解译中是最难的,我先不告诉你,你自己来猜一猜下次我来公布答案。另外我们也做了很多类似的深度学习在卫星影响上的应用相关的例子,比如找房子就用到了TensorFlow对象检测利用MXNet和亚马逊SageMaker来做的分类模型另外一个利用Keras和亚马逊云计算机分类模型。

废话说了那么多,照着现在深度学习发展的速度,开发新的算法其事不算太难。难的是怎么准备机器学习和深度学习可以使用训练数据集。

我今天隆重的来给大家介绍一下我们的pytho 软件包Label Maker。Label Maker是个开源的软件所以在github上面大家随意给我们点赞和folk,我们也鼓励大家踊跃贡献。Label Maker通过获取Mapbox的卫星影像和OpenStreetMap的矢量数据(比如道路,房子,树林)等等,打包和生成训练数据。大家可以把这个数据链接到自己最喜爱的省督学习和机器学习上面。Label Maker模型只需要五个命令行就帮你生成训练数据集了哦。

只要 pip install label_maker之后,跑这四行命令行就可以了。

label-maker download         # download OpenStreetMap QA Tiles
label-maker labels           # create your ground-truth labels
label-maker images           # download satellite imagery tiles
label-maker package          # package tiles and labels into data.npz

当然我省略了两小个步骤:

比如要从Mapbox上下载卫星影像图片,你要有一个他们影像API的token,所以去注册一个Mapbox的账号吧。

然后使用上面的四个命令行之前,要生成一个配置文件(configure file),先这样的:

 

{
  "country": "vietnam",
  "bounding_box": [105.42,20.75,106.41,21.53],
  "zoom": 17,
  "classes": [
    { "name": "Buildings", "filter": ["has", "building"] }
  ],
  "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=ACCESS_TOKEN",
  "background_ratio": 1,
  "ml_type": "classification"
}

我们的python软件包会读取配置文件里面的参数来生成你需要的训练数据集。记得在把配置文件中的ACCESS_TOKEN 替换成你mapbox上生成的token哟。

等上面四个命令行顺利跑完成之后,你就有了data.npz就可以跑你最爱的机器学习算法了。比如下面这样:

 

# the data, shuffled and split between train and test sets
npz = np.load('data.npz')
x_train = npz['x_train']
y_train = npz['y_train']
x_test = npz['x_test']
y_test = npz['y_test']

# define your model here, example usage in Keras
model = Sequential()
# ...
model.compile(...)

# train
model.fit(x_train, y_train, batch_size=16, epochs=50)
model.evaluate(x_test, y_test, batch_size=16)

想要了解更详细的信息不要忘了访问我们的GitHub请大家不令赐赞👍和✨吧。

Artificial intelligence on urban tree species identification 人工智能在市区树种识别上的应用

It doesn’t matter which part of the world you are living now,  very diverse tree species are planted around the urban area we live.  Trees in the urban areas have many functions, for example, trees provide habitats for wildlife, clean air and water, provide significant health and social benefits, and also improve property value too.  Wake up in a beautiful morning that birds are singing outside your apartment because you have many beautiful trees grow outside of your space. How awesome is that!

However, tree planting, survey, and species identification require an enormous amount of work that literally took generations and years of inputs and care. What if we could identify tree species from satellite imagery, how much faster and how well we could get tree species identified and also tell their geolocations as well.

A city has its own tree selection and planting plan, but homeowners have their own tree preference, which the identification work a bit complicated, though.

chicagoTrees

(Photo from Google Earth Pro June 2010 in Chicago area)

It’s hard to tell now how many tree species are planted in above image. But we could (zoom in and) tell these trees actually have a slightly different shape of tree crown, color, and texture. From here I only need to have a valid dataset basically tell me what tree I am looking at now, which is a tree survey and trees geolocation records from the city. I will be able to teach a computer to select similar features for the species I’m interested in identifying.

GreeAsh

These are Green Ash trees (I marked as green dots here).

LittleleafLiden.png

These are Littleleaf Linden, they are marked as orange dots.

Let me run a Caffe deep learning model (it’s one of the neural networks and also known as artificial intelligence model) for an image classification on these two species, and see if the computer could separate these two species from my training and test datasets.

Great news that the model could actually tell the differences between these two species. I run the model for 300 epochs (runs) from learning rate 0.01 to 0.001 on about 200 images for two species. 75% went to train the model and 25% for testing. The result is not bad that we have around 90% of accuracy (orange line) and less than 0.1 loss on the training dataset.

nvidia_d_modeltest

I threw a random test image to the model (a green ash screenshot in this case) and it tells the result.

test_trees2

I will be working on identifying other 20 trees species and their geolocations next time.

Let’s get some answer what trees are planted in Chicago area and how it related to the property value (an interesting question to ask), and also what ecological benefits and functions these tree are providing (leave this to urban ecologist if my cloud computer could identify the species)? Check my future work ;-).

 

Can artificial intelligence help us identify wildfire damage from satellite imagery faster? 我们能否借助人工智能算法快速地从卫星影响中定位火灾损毁地点和损毁程度?

The following work was done by me and Dr. Shay Strong, while I was a data engineer consultant under the supervision of Dr. Strong  at OmniEarth Inc. All the work IP rights belong to OmniEarth. Dr Strong is the Chief Data Scientist at OmniEarth Inc.

以下要介绍的工作是我在OmniEarth公司做数据工程师的时候和Shay Strong博士共同完成的工作。工作的知识产权归OmniEarth公司所有,我的老板Shay Strong博士是OmniEarth公司的数据科学家团队的领头人。

A wildfire had been burning in the Great Smoky Mountains of Tennessee and raced rapidly northward toward Gatlinburg and Pigeon Forge between late Nov. and Dec. 2nd, 2016. At least 2000 buildings were damaged or destroyed across 14,000 acres of residential and recreational land, while the wildfire also claimed 14 lives and injured 134. It was the largest natural disaster in the history of Tennessee.

2016年11月到12月田纳西州的大烟山国家公园森林(Great Smoky Mountains)大火,随后火势蔓延至北部的两个地区Gatlinburg 和Pigeon Forge。据报道大火损毁2000多栋包括民宅和旅游区建筑物,损毁面积达到1万4千英亩,火灾致使14人死亡134人受伤。被认为是田纳西州历史上最大的自然灾害。

After obtaining 0.4 m resolution satellite imagery of the wildfire damage in Gatlinburg and Pigeon Forge from Digital Global, OmniEarth Inc created an artificial intelligence (AI) model that was able to assess and identify the property damage due to the wildfire. This AI model will also be able to more rapidly evaluate and identify areas of damage from natural disasters from similar issues in the future.

从Digital Global获得大约为0.4米分辨率的高分辨率遥感图像(覆盖了火灾发生的Gatlinburg 和Pigeon之后)我们建立了人工智能模型。该人工智能模型可以帮助我们快速定位和评估火宅受灾面积和损毁程度。我们希望该模型未来可帮助消防人员快速定位火灾险情和火灾受损面积。

The fire damage area was identified by the model on top of the satellite images.

该地图链接是我们的人工智能模型生成的火灾受损地区热图在卫星地图上的样子:http://a.omniearth.net/wf/。

2017-01-26 22.15.10.gif

Fig 1. The final result of fire damage range in TN from our AI model. 该图是通过人工智能模型生成的火灾受灾范围图。

1. Artificial intelligence model behind the wildfire damage火灾模型背后的人工智能

With assistance from increasing cloud computing power and a better understanding of computer vision, more and more AI technology is helping us detect information from trillions of photos we produce daily.计算机图像识别和云计算能力的提升,使得我们能够借助人工智能模型获取数以万计甚至亿计的照片地图等图片中获取有用的信息。

Before diving into the AI model behind the wildfire damage, in this case, we only want to identify the differences between fire-damaged buildings and intact buildings. We have two options: (1), we could spend hours and hours browsing through the satellite images and manually separate the damaged and intact buildings or (2) develop an AI model to automatically identify the damaged area with a tolerable error. For the first option, it would easily take a geospatial technician more than 20 hours to identify the damaged area among the 50,000 acres of satellite imagery. The second option poses a more viable and sustainable solution in that the AI model could automatically identify the damaged area/buildings less than 1 hour over the same area. This is accomplished by image classification in AI, using convolutional neural networks (CNN) specifically, because CNN works better than other neural network algorithms for object detection and recognition from images.

在深入了解人工智能如何工作之前,在解决火灾受灾面积和受损程度这个问题上,其实我们要回答的问题只有一个那就是如何在图像上区分被烧毁的房屋和没有被烧毁的房屋之间的区别。要回答这个问题,我们可以做:(1)花很长的时间手动从卫星影像中用人眼分辨受损房屋的位置;(2)建一个人工智能模型来快速定位受损房屋的位置。现在我们通常的选择是第一种,但是在解决田纳西那么多房屋损毁的卫星影像上,我们至少需要一个熟悉地理信息系统和遥感图像的技术人员连续工作至少20个小时以上才能覆盖火灾发生地区覆盖大约5万英亩大小的遥感图像。相反,如果使用人工智能模型,对于同样大小区域范围的计算,模型运行到出结果只需要少于1小时的时间。这个人工智能模型具体来说用的是卷积神经网络算法,属于图像分类和图像识别范畴。

Omniearth_satellite

Fig 2. Our AI model workflow. 我们的人工智能模型框架。

Artificial intelligence/neural networks are a family of machine learning models that are inspired by biological neurons of our human brain. First conceived in the 1960s, but the first breakthrough was Geoffrey Hinton’s work published in the mid-2000s. While our human eyes work like a camera seeing the ‘picture,’ our brain will process it and be able to construct the objects we see through the shape, color, and texture of the objects. The information of “seeing” and “recognition” is passing through our biological neurons from our eyes to our brain. The AI model we created works in a similar way. The imagery is passed through the artificial neural network, and objects that have been taught to the neural network are identified with certain accuracy. In this case, we taught the network to learn the difference between burnt and not-burnt structures in Gatlinburg and Pigeon Forge, TN.

2. How did we build the AI model

We broke down the wildfire damage mapping process into four parts (Fig 1). First, we obtained the 0.4m resolution satellite images from Digital Globe (https://www.digitalglobe.com/). We created a training and a testing dataset of 300 small images chips (as shown in Fig 3, A and B) that contained both burnt and intact buildings, 2/3 of which go to train the AI model, CNN model in this case, and 1/3 of them are for test the model. Ideally, the more training data used to represent the burnt and non-burnt structures are ideal for training the network to understand all the variations and orientations of a burnt building. The sample set of 300 is on the statistically small side, but useful for testing capability and evaluating preliminary performance.

 burned.png  intact.png
Fig 3(A). A burnt building Fig3(B). Intact buildings

Our AI model was a CNN model that built upon Theano (GPU backend) (http://deeplearning.net/software/theano/). Theano was created by the Machine Learning group at the University of Montreal, led by Yoshua Bengio, who is one of the pioneers behind artificial neural networks. Theano is a Python library that lets you define and evaluate mathematical expressions with vectors and matrices. As a human, you can imagine our daily decision-making is based on the matrices of perceived information as well, e.g. which car you want to buy. The AI model helps us to identify which image pixels and patterns are fundamentally different between burnt and intact buildings, similar to how people give a different weight or score to the car brand, model, and color they want to buy. Computers are great at calculating matrices, and Theano brings it to next level because it calculates multiple matrices in parallel, and so speeds up the whole calculation tremendously. Theano has no particular neural network built-in, so we use Keras on top of Theano. Keras allows us to build an AI model with a minimalist design on training layers of a neural network and run it more efficiently.

Our AI model was run on AWS EC2 with a g2.2xlarge instance type. We set the learning rate (lr) to 0.01.. A smaller learning rate will force the network to learn more slowly but may also lead to optimal classification convergence, especially in cluttered scenes where a large amount of object confusion can occur. In the end, our AI model with came out with 97% of accuracy, less than 0.3 loss over three runs within a minute, and it took less than 20 minutes to run on our 3.2G satellite images.

The model result was exported and visualized using QGIS (http://www.qgis.org/en/site/). QGIS is an open source geographic information system that allows you to create, edit, visualize, analyze and publish geospatial information and maps. The map inspection was also done through comparing our fire damage results to the briefing map produced by Wildfire Today (https://inciweb.nwcg.gov/incident/article/5113/34885/) and Incident Information System (https://inciweb.nwcg.gov/incident/article/5113/34885/).

omniearthPacel.png

Fig 4. (A). using OmniEarth parcel level burnt and intact buildings layout on top of the imagery.

Burned_map.png

Fig 4 (B). The burnt impact (red color) on top of the Great Smoky Mountains from late Nov. to early Dec 2016.

Satellite image classification is a challenging problem that lies at the crossroads of remote sensing, computer vision, and machine learning. A lot of currently available classification approaches are not suitable to handle high-resolution imagery data with inherent high variability in geometry and collection times. However, OmniEarth is a startup company that is passionate about the business of science and scaling quantifiable solutions to meet the world’s growing need for actionable information.

Contact OmniEarth for more information:

For more detailed information, please contact Dr. Zhuangfang Yi, email: geospatialanalystyi@gmail.com; twitter: geonanayi.

or

Dr. Shay Strong, email: shay.strong@omniearthinc.com; twitter: shaybstrong.

Start your own Amazon Web Service instance for deep learning怎么样开始建一个你自己的亚马逊深度学习机器

I am back to my blogging life after awhile~ 好久没有写博客,我又回来了!

 

I’ve been working on image classification and segmentation quite a lot recently, and totally in love with GPU big data processing. If you wanna process data that at gigabyte (G) level data definitely look into start a GPU AWS instance 最近我的工作接触了很多图像分类,和图像分割的内容,感觉自己太爱gpu图像分析的世界:太神速了。如果你现在处理的数据已经达到G级别了,我觉得你还是应该开一个亚马逊的ami(亚马逊的深度学习平台/机器)

It is not free, though. You definitely would start with AWS free tier, but I normally use their g or p machines. For example, if I use g2.2 x large, I will be charged about $0.65 per hour.  for more information, go here. It charges by how much you use and if you are new to deep learning and just wanna run some case studies, I think it worths more than building your own GPU machine or buy a new pc with super GPU.

但是话说回来亚马逊的ami其实也不是免费的。我现在用的机器主要两种p和g。比如我现在一般用的是g2.2 x large,价格大概在0.65美金一个小时。更多的选择可以看这里。我觉得这个还是很有吸引力的,如果你只是想要跑几个学习案例的话,我觉得这个ami非常棒。总之还是比现在才在学习阶段,就买台有gpu的电脑或者建自己的gpu机器学习平台有用。

AWS_Charge

You should definitely do some research on: 在去开个亚马逊深度学习ami之前,我觉得大家该想想:

  1. What do you wanna do with the AWS machine? Do you wanna learn just some basic machine learning stuffs that you only need to process megabyte (?M) level csv/txt data file you could just use your personal computer. A personal computer is fast enough though days. 你想拿这个亚马逊深度学习平台来做什么?如果只是用来处理几兆几十兆的数据的话,那还是没有必要开一个,现在的个人电脑那么快完全可以处理这些数据了。
  2. As I mentioned above, if you wanna process images or data that above some certain level your personal computer could not handle. Think about how much you wanna spend on the data processing. Again, evaluate your situation, needs and do some research. 但是,如果你的数据量已经是在几百兆或者g级别的,当然还是很有必要开一个的。话说回来,还是应该做些调查研究加上考量自己的情况。

My needs for this personal AWS EC2 machine are: 我需要这个亚马逊ami深度学习平台,主要是想用来做:

  1. Processing big data set on neural network image classification and segmentation;图像分类和图像分割;
  2. A machine that has Tensorflow, Theano, Torch, Keras, and also Caffe installed. Tensorflow, Theano, Torch, and Caffe are deep learning ecosystem/environment. Keras is the python module that I use to build deep learning algorithm architecture.想这个ami机器上有我想用的几个深度学习框架,比如Tensorflow, Theano, Torch, and Caffe。还有如果有keras,python的一个构建深度学习/机器学习的包。

If you are thinking about doing the same things, this is a great blog to start your own AWS AMI Instance here or this one. They both have explicit instructions on how to star the instance.

如果你觉得我的博客还不是很清楚,这两个博客有非常好的步骤教你一步一步的开始怎么建一个亚马逊的深度学习ami机器。第一个博客和第二个

Second options of launching an AWS AMI with a jupyter notebook server without going through all the AWS web console. Using the following command line in your terminal:

startJupyterNotebookServer

Copy and paste the following command lines (CLI) from above figure.

# create security group
aws ec2 create-security-group –group-name JupyterSecurityGroup –description “My Jupyter security group”

# add security group rules
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 8888 —cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 22 —cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress –group-name JupyterSecurityGroup –protocol tcp –port 443 —cidr 0.0.0.0/0

# launch instance
aws ec2 run-instances –image-id ami-41570b32 –count 1 –instance-type p2.xlarge –key-name <YOUR_KEY_NAME> –security-groups JupyterSecurityGroup

The next thing would be to configure your Jupyter Notebook Server:

cert

jupyter notebook –generate-config
key=$(python -c “from notebook.auth import passwd; print(passwd())”)

cd ~
mkdir certs
cd certs
certdir=$(pwd)
openssl req -x509 -nodes -days 365 –newkey rsa:1024 –keyout mycert.key -out mycert.pem

cd ~
sed -i “1 a\
c = get_config()\\
c.NotebookApp.certfile = u’$certdir/mycert.pem’\\
c.NotebookApp.keyfile = u’$certdir/mycert.key’\\
c.NotebookApp.ip = ‘*’\\
c.NotebookApp.open_browser = False\\
c.NotebookApp.password = u’$key’\\
c.NotebookApp.port = 8888″ .jupyter/jupyter_notebook_config.py

These CLI are to create your AWS AMI certificate for Jupyter Notebook server, and then you could run and test out if your jupyter notebook works, after seccessfully run above CLI.

screen -S jupyter
mkdir notebook
cd notebook
jupyter notebook

For more info you could see this blog for details.

If you wanna use Ubuntu AMI instead of Amazon AMI here is another good blog for setting up the jupyter notebook server on the machine

https://chrisalbon.com/jupyter/run_project_jupyter_on_amazon_ec2.html

 

 

 

A bit of crazy machine learning things and my showcase 2-using logistic regression to predict the income category 神奇的机器学习以及逻辑斯蒂回归模型案例

Uber will offer self-drive cars in Philly this Nov., and soon or later you will get a ride in a Uber that pops up in your doorway without a human driver. It’s so fascinating but crazy at the same time. It sounds like a science fiction, but definitely, it will be real soon. What has brought this to reality partially is machine learning, and it definitely deserves a credit. 

优步打车马上就要在美国的费城像广大人民群众发布他们的无人驾驶汽车了。这个好像只有在科幻电影里面才会出现的事情,很快却要实现了,当然到大面积普及还是有一段时间的。这个无人驾驶汽车的背后是一系列神奇的算法,我们称之为‘机器学习’。

What’s machine learning? It’s a way we teach a computer to learn from thousands and millions of data records, to find patterns or rules, so it could behave/finish a task the way we want it. It is very similar to we teach babies or pets how to learn things. For example, we teach a computer in the self-drive car to remember the roads, and how to navigate in the cities for thousands of times, so it learns how to drive, so it could behave the way we want it. Let’s wait to see how the users’ reviews of Uber self-drive car this Nov. 

那什么是机器学习呢?机器学习和教你的小孩和宠物学习新东西其实是无异的呢,只是机器学习里面的学生是电脑而已。就像我们说的无人驾驶汽车里面使用的电脑可以通过反复学习一个城市的路况,而再也不需要人类司机了。但是究竟这个电脑司机能比人类司机好多少倍当然就不得而知了, 所以大家就拭目以待今年11月份不同的优步用户的感受吧。

If we said, babies grow knowledge from EXPERIENCEs, and then a computer, with machine learning algorithm, learns from thousands and millions of data records. From the past (and only can be from the past because we don’t have data records from future) data records, it finds the pattern or courses that could be repeated in the future. It’s part of artificial intelligence (AI). 

如果说人类的小朋友长大成熟是通过经验的积累,那么机器学习里面的机器就是通过过往的数据来学习的,请大家注意机器是只能通过过去的数据来学习的,因为我们并没有未来的数据一说。机器通过学习这些已有的数据记录找出规律和规则来指导它未来的行为。这就是机器学习,同时也是我们说的人工智能的一部分。

Machine learning algorithms are used commonly in our daily life. The recommendations from our current favorite websites, e.g. spam emails identification,  your favorite movie/TV list from Amazon or Netflix, favorite songs from Spotify or Pandora. Credit card companies could spot a fraud when the credit card is used in an unusual location according to your past spending records.  Several startup companies already using the algorithms to help the customers to pick up clothes according to their personal tastes. The algorithm behind the pattern sorting is Machine Learning. In these case, you would wonder how computer learns about your favors and tastes if you only use the services for several times, but don’t forget there are millions and billions of people as the data points. To a computer or an algorithm particularly, your eating, learning, tasting and other habits are the data points together with other millions of data points (users). You could be learned from your habits but also could be studied from other users in the algorithm data cloud.  The accuracy of the algorithm really depends on the algorithm and the person who set the rules, though. 

 机器学习算法在日常生活中是非常常见的。比如大家去淘宝买东西,淘宝会有一系列你可能会喜欢的商品推荐,你去电影网站看电影它们也会通过你过往观看的影片给你推荐电影,现在的音乐网站也有推荐歌曲的列表。现在也有网站开始做根据你的个人品味配衣服这样的事情了。另外很多信用卡公司能够在第一时间通知你,你的信用卡可能被盗用。有时大家可能会觉得奇怪,为什么你只看了那么一两次网站还是会找得出你可能喜欢的东西呢?就是因为网站上所有用户其实都是一个个数据点,就像你在网站上其实也就是一个数据点一样的,网站通过学习其他成千上万个数据点,就可以把你归类了。但是大家有时候也会发现,机器也有出错的时候,而且这个几率其实也不低。这个就完全取决于机器里面的算法以及设定规则的人了。

Machine learning sounds very fancy and cutting edge but it’s not, in term of methodologies using is close to data mining and statistics, which means you could apply any statistical and mathematical methodologies you’ve learned from school. Machine learning is not about what computer languages you use to code, or it’s run on a super computer, but the essential is all about the algorithm. However, it’s very fancy in the way that the data scientists could dig out the best algorithms/ pattern from data that could assist us in a better decision on the daily basis, or you don’t even need to make a decision yourself but could just ask the Apps or your computer. 

机器学习和人工智能听起来相当神秘,但是其实机器学习是比较接近数据学习和统计学的,所以你以前统计和数学课上学习过的知识都是有用的呢。机器学习的目的是找出最好的算法,而不用管你是用哪一种计算机语言写的,也不用管你的计算是否是在超级计算机上完成的。最好的算法是反映真实情况的,而且能够帮助大家在日常生活中做最好决定的算法。

These are a series of blogs that I try to write. The ultimate goal is, of course, to unlock what the popular algorithms that behind machine learning. I’ve presented a showcase in my last blog, which is the bike demand prediction of Capital Bikeshare, using multiple linear regression. This blog will be the showcase 2 of logistic regression. Even though you might think logistic regression is a kind of regressions, but it’s not. It’s a classification method; it’s used to answer YES or NO, e.g. is this patient has cancer or not; is this a bad loan or not. That’s when the false positive and false negative come in, or called them Type I error and Type II error in statistics. When you read about what it’s actually about, your math teacher might say “Type I error, and Type II error are where a positive result corresponds to rejecting the null hypothesis, and a negative result corresponds to not rejecting the null hypothesis.” And….ZZzzzz… then you fell asleep and never understood what they are. 

其实我想写一系列的博客来解读机器学习这个东西,毕竟我也是统计渣而且也正在学习。主要的目的还是想通过博客写作的方式让大家(其实最主要是我自己)了解机器学习更深刻一些。我上一个博客中写到的自行车租用系统算是这一系列博客里面的第一篇吧,如果大家对机器学习感兴趣,我建议你去看一下上一篇的博客。那这个博客就算是学习案例2吧,说的是逻辑斯蒂回归模型。在过去的统计学习课上,大家可能会以为逻辑斯蒂回归模型是回归中的一种,但是其实逻辑斯蒂回归模型是一种分类方法学,是用来判断“是”或者“不是”的,比如医学中常用来判断,这个病人是不是得了癌症;银行用来判断这个贷款是不是坏账。谈到这里,那就不得不提统计学中的第一类错误和第二类错误(统计学大虾们,是这么翻译的么?!)就是false positive (故障阴性) 和 false nagative(假阳性)—什么鬼!!然后你的统计学老师就会说:第一类错误就是你的阳性结论否定你的零假设,和第二类错误是你的阴性结论否定你的零假设,然后就在怒吼一次—什!!!么!!鬼!!!!然后就直接晕厥在课堂上再也不记得老师接下来讲了什么了,是吗?!

Here is a good way to remember them. 

其实应该这么记住什么是第一类错误和第二类错误。

If you are a question/make a hypothesis that ‘this person is pregnant’. Later you collected a tremendous amount of data to test your hypothesis, and here is the example what ”False Positive’ and ‘False Negative’ is: 

如果你的零假设(打脸!)是“这个人怀孕了”,然后为了证明这个结论你就找了一堆数据来验证你的结论对吗?!跑了一堆逻辑斯蒂回归模型,在判断“是”或“不是”的时候,你就有了四个结论。“怀孕。是”,“没怀孕。是”,“怀孕,不是”和“没怀孕。不是”。好模型和好算法就是以上双重肯定(“怀孕。是”)和双重否定(“没怀孕。不是”)占四类情况里面的大部分,就是计算的结论是把怀孕的人归到怀孕一类,没怀孕的归到没有怀孕一类。那么一下就是第一类错误:告诉你一个男性说他怀孕了(就是上面的“没怀孕。是”没有怀孕却被认定为“是”)还有第二类错误就是:告诉一个孕妇说你没有怀孕(就是上面的“怀孕,不是”,明明人家怀孕了计算结论却认定为没怀孕)。

FPCq0

(Graph from https://effectsizefaq.com/category/type-i-error/ )

Note: Don’t stop here, the actual Type I error, and Type II error are a bit more complicated than this graph but hope it helps you to remember them as it does to me. 

注:虽然这个图可以帮大家记住什么是统计学中的第一类错误和第二类错误,但是错误类别其实比上图要复杂那么一点点。大家想一下要是你的问题或者零假设是“这个人没有怀孕”呢?!

Showcase 2. using logistic regression to predict if your salary is gonna be more than 50K

学习案例2.用逻辑斯蒂回归模型预测个人收入是否会高于5万

Here, I use an example to tell you how it works. 下面我就给大家讲一下这个模型是怎么工作的。

The dataset I use here was downloaded from UCI, it’s about 35,000 data records, and the dataset structure looks like the following graph. We have variables of age, type of employer, education and educational years, marital status, race, work hour per week, original country, and salary.  This is just a showcase for studying logistic regression. 

这个数据是从UCI下载来的,大概有3.5万条数据记录,数据格式看起来就是下图这样的。变量包括了个人年龄,雇主类型,教育情况和教育年限,婚姻状况,种族,每周工作小时数,原国籍和收入情况。这个数据只是用来学习逻辑斯蒂回归模型的,本人对结论不负责哦。

  1 raw data.png

Let’s see some interesting patterns of the data, the correlation between salary categories (<50k, >50k) and education, race, sex, marital status, etc., before we go into the logistic model. 

在跑逻辑斯蒂回归模型之前,让我们来看看个人的收入(薪水)类别(年薪大于五万和小于五万)和教育,种族,性别,婚姻状况都有什么联系。

Rplot06

People who are married tend to earn more than >50k than people who never married or currently not married. 

结婚了大人可能收入大于5万的总人数会比不婚族和还没有结婚的人要高。

Rplot03

A lot more people earn less than 50k when they are about 25 years old, and people who are age between 40 to 50 are likely to earn more than 50k. 

大部分年纪在25岁左右的人主要收入都少于5万,收入大于五万的人一般都在40岁到50岁之间。

Rplot04

Earning more than 50k or less is not depends on longer hours you work per week.

其实不管一周工作多少个小时,收入也还是不会改变多少呢。

Rplot07

People who get more years of education earn a bit more doesn’t matter it’s male or female, of course, you can’t tell that if you would earn more with more education as well. 

受教育多的人普遍工资都偏高,不管男性还是女性。但是也不能说明受教育年限越高就说明收入越高。

Rplot08

More people are employed in private sectors, and doesn’t matter where the person are employed, women are likely to be in the salary category of <50k. It means in the same type of employers; women are likely to be paid less.  

数据中受雇于私人部门的人更多,而且其中男性雇员的年薪大于5万的人要比女性更多,不管他们受雇于哪一个部门。也就是说,女性在同样的工种之中可能拿到的工资比男性要低。

model.png

Before running the logistic regression, I split the dataset into 2 parts: training dataset and testing dataset. Training data takes up about 70 percent of the whole dataset. After running the model, I use the testing data to predict if my model/algorithm is good enough. This is when we will find out from the rate of Type I error and Type II error. For detail R codes I wrote you could go to my GitHub.

模型检测方法就是,在建立模型之前要把我们收集到的数据分成模型数据和检测数据。模型数据一般是整个数据的70%左右,但是这个也不一定,随你怎么定都行。一般模型运行完成之后,我们就需要把检测数据带到模型中,通过对比真实记录的结果和模型预测的结果来检测模型是否是最好的模型。我写了详细的模型代码和检测代码,原始的代码在这里.

From the model (above graph) you see that some factors (variables) have positive impacts on income, e.g. age, married, but some have negative impacts, e.g. when a person’s education is between 4th to 9th grade or preschool…Since I tried not to confuse you all with the statistical part but if you wanna understand a bit more about the statistics of the algorithm I recommend you to read this book: An Introduction to Statistical Learning. You could go to Chapter 4 particularly at this book for the logistic regression. 

通过上图的模型结果大家可以看到有些变量对于个人收入的预测是有正面的影响的,比如年龄,结婚等,另外有一些又是有负面影响的,比如受教育低。这个博客写作我还是忽略了很多统计的部分。但是如果大家想了解逻辑斯蒂回归模型可以去看An Introduction to Statistical Learning 这一本书,书中讲很多R在统计学习中的应用。关于逻辑斯蒂回归模型大家直接可以跳到第四章去学习。

If we wanna know the algorithm I built was a good one, I need to test the model and these following parameters will give me an answer to it.  For example, the accuracy of the model is measured by the proportion of true positive and true negative in the whole dataset.

关于我们建立的这个模型是否是个好模型那么就需要这几个参数来考量。所选模型的精确度就靠一下图中的accuracy(精确度)公式来确定了。

azure-machine-learning-intro-18-638

 

There are three categories of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Logistic regression and linear regress have belonged to the supervised learning algorithm. 

我们其实可以把机器学习的算法归为三类,分别是:监督学习,非监督学习以及加固学习。我的两个博客中提到的多元线性模型和逻辑斯蒂回归模型属于机器学习中的监督学习算法。

My best self-taught strategy is ‘learning by doing’—‘get your hand dirty’ is always the best way to get good at of somethings you wanna master, and I have so much fun learning what algorithm and statistics behind machine learning, and here are some great blogs to read too. If you are interested in learning more, you could follow my blog or twitter: @geonanayi

我自学的宗旨是在‘动手过程中学习’, ‘get your hand dirty’永远都是最好的学习和巩固知识的最好方式。做这些案例学习真的是学习到很多背后的统计和数学方法。大家如果有时间也可以读一读这个博客,如果你想要和我一起学习“机器学习的算法”也可以加我的Twitter:@geonanayi