A bit of crazy machine learning things and my showcase 2-using logistic regression to predict the income category 神奇的机器学习以及逻辑斯蒂回归模型案例

Uber will offer self-drive cars in Philly this Nov., and soon or later you will get a ride in a Uber that pops up in your doorway without a human driver. It’s so fascinating but crazy at the same time. It sounds like a science fiction, but definitely, it will be real soon. What has brought this to reality partially is machine learning, and it definitely deserves a credit. 

优步打车马上就要在美国的费城像广大人民群众发布他们的无人驾驶汽车了。这个好像只有在科幻电影里面才会出现的事情,很快却要实现了,当然到大面积普及还是有一段时间的。这个无人驾驶汽车的背后是一系列神奇的算法,我们称之为‘机器学习’。

What’s machine learning? It’s a way we teach a computer to learn from thousands and millions of data records, to find patterns or rules, so it could behave/finish a task the way we want it. It is very similar to we teach babies or pets how to learn things. For example, we teach a computer in the self-drive car to remember the roads, and how to navigate in the cities for thousands of times, so it learns how to drive, so it could behave the way we want it. Let’s wait to see how the users’ reviews of Uber self-drive car this Nov. 

那什么是机器学习呢?机器学习和教你的小孩和宠物学习新东西其实是无异的呢,只是机器学习里面的学生是电脑而已。就像我们说的无人驾驶汽车里面使用的电脑可以通过反复学习一个城市的路况,而再也不需要人类司机了。但是究竟这个电脑司机能比人类司机好多少倍当然就不得而知了, 所以大家就拭目以待今年11月份不同的优步用户的感受吧。

If we said, babies grow knowledge from EXPERIENCEs, and then a computer, with machine learning algorithm, learns from thousands and millions of data records. From the past (and only can be from the past because we don’t have data records from future) data records, it finds the pattern or courses that could be repeated in the future. It’s part of artificial intelligence (AI). 

如果说人类的小朋友长大成熟是通过经验的积累,那么机器学习里面的机器就是通过过往的数据来学习的,请大家注意机器是只能通过过去的数据来学习的,因为我们并没有未来的数据一说。机器通过学习这些已有的数据记录找出规律和规则来指导它未来的行为。这就是机器学习,同时也是我们说的人工智能的一部分。

Machine learning algorithms are used commonly in our daily life. The recommendations from our current favorite websites, e.g. spam emails identification,  your favorite movie/TV list from Amazon or Netflix, favorite songs from Spotify or Pandora. Credit card companies could spot a fraud when the credit card is used in an unusual location according to your past spending records.  Several startup companies already using the algorithms to help the customers to pick up clothes according to their personal tastes. The algorithm behind the pattern sorting is Machine Learning. In these case, you would wonder how computer learns about your favors and tastes if you only use the services for several times, but don’t forget there are millions and billions of people as the data points. To a computer or an algorithm particularly, your eating, learning, tasting and other habits are the data points together with other millions of data points (users). You could be learned from your habits but also could be studied from other users in the algorithm data cloud.  The accuracy of the algorithm really depends on the algorithm and the person who set the rules, though. 

 机器学习算法在日常生活中是非常常见的。比如大家去淘宝买东西,淘宝会有一系列你可能会喜欢的商品推荐,你去电影网站看电影它们也会通过你过往观看的影片给你推荐电影,现在的音乐网站也有推荐歌曲的列表。现在也有网站开始做根据你的个人品味配衣服这样的事情了。另外很多信用卡公司能够在第一时间通知你,你的信用卡可能被盗用。有时大家可能会觉得奇怪,为什么你只看了那么一两次网站还是会找得出你可能喜欢的东西呢?就是因为网站上所有用户其实都是一个个数据点,就像你在网站上其实也就是一个数据点一样的,网站通过学习其他成千上万个数据点,就可以把你归类了。但是大家有时候也会发现,机器也有出错的时候,而且这个几率其实也不低。这个就完全取决于机器里面的算法以及设定规则的人了。

Machine learning sounds very fancy and cutting edge but it’s not, in term of methodologies using is close to data mining and statistics, which means you could apply any statistical and mathematical methodologies you’ve learned from school. Machine learning is not about what computer languages you use to code, or it’s run on a super computer, but the essential is all about the algorithm. However, it’s very fancy in the way that the data scientists could dig out the best algorithms/ pattern from data that could assist us in a better decision on the daily basis, or you don’t even need to make a decision yourself but could just ask the Apps or your computer. 

机器学习和人工智能听起来相当神秘,但是其实机器学习是比较接近数据学习和统计学的,所以你以前统计和数学课上学习过的知识都是有用的呢。机器学习的目的是找出最好的算法,而不用管你是用哪一种计算机语言写的,也不用管你的计算是否是在超级计算机上完成的。最好的算法是反映真实情况的,而且能够帮助大家在日常生活中做最好决定的算法。

These are a series of blogs that I try to write. The ultimate goal is, of course, to unlock what the popular algorithms that behind machine learning. I’ve presented a showcase in my last blog, which is the bike demand prediction of Capital Bikeshare, using multiple linear regression. This blog will be the showcase 2 of logistic regression. Even though you might think logistic regression is a kind of regressions, but it’s not. It’s a classification method; it’s used to answer YES or NO, e.g. is this patient has cancer or not; is this a bad loan or not. That’s when the false positive and false negative come in, or called them Type I error and Type II error in statistics. When you read about what it’s actually about, your math teacher might say “Type I error, and Type II error are where a positive result corresponds to rejecting the null hypothesis, and a negative result corresponds to not rejecting the null hypothesis.” And….ZZzzzz… then you fell asleep and never understood what they are. 

其实我想写一系列的博客来解读机器学习这个东西,毕竟我也是统计渣而且也正在学习。主要的目的还是想通过博客写作的方式让大家(其实最主要是我自己)了解机器学习更深刻一些。我上一个博客中写到的自行车租用系统算是这一系列博客里面的第一篇吧,如果大家对机器学习感兴趣,我建议你去看一下上一篇的博客。那这个博客就算是学习案例2吧,说的是逻辑斯蒂回归模型。在过去的统计学习课上,大家可能会以为逻辑斯蒂回归模型是回归中的一种,但是其实逻辑斯蒂回归模型是一种分类方法学,是用来判断“是”或者“不是”的,比如医学中常用来判断,这个病人是不是得了癌症;银行用来判断这个贷款是不是坏账。谈到这里,那就不得不提统计学中的第一类错误和第二类错误(统计学大虾们,是这么翻译的么?!)就是false positive (故障阴性) 和 false nagative(假阳性)—什么鬼!!然后你的统计学老师就会说:第一类错误就是你的阳性结论否定你的零假设,和第二类错误是你的阴性结论否定你的零假设,然后就在怒吼一次—什!!!么!!鬼!!!!然后就直接晕厥在课堂上再也不记得老师接下来讲了什么了,是吗?!

Here is a good way to remember them. 

其实应该这么记住什么是第一类错误和第二类错误。

If you are a question/make a hypothesis that ‘this person is pregnant’. Later you collected a tremendous amount of data to test your hypothesis, and here is the example what ”False Positive’ and ‘False Negative’ is: 

如果你的零假设(打脸!)是“这个人怀孕了”,然后为了证明这个结论你就找了一堆数据来验证你的结论对吗?!跑了一堆逻辑斯蒂回归模型,在判断“是”或“不是”的时候,你就有了四个结论。“怀孕。是”,“没怀孕。是”,“怀孕,不是”和“没怀孕。不是”。好模型和好算法就是以上双重肯定(“怀孕。是”)和双重否定(“没怀孕。不是”)占四类情况里面的大部分,就是计算的结论是把怀孕的人归到怀孕一类,没怀孕的归到没有怀孕一类。那么一下就是第一类错误:告诉你一个男性说他怀孕了(就是上面的“没怀孕。是”没有怀孕却被认定为“是”)还有第二类错误就是:告诉一个孕妇说你没有怀孕(就是上面的“怀孕,不是”,明明人家怀孕了计算结论却认定为没怀孕)。

FPCq0

(Graph from https://effectsizefaq.com/category/type-i-error/ )

Note: Don’t stop here, the actual Type I error, and Type II error are a bit more complicated than this graph but hope it helps you to remember them as it does to me. 

注:虽然这个图可以帮大家记住什么是统计学中的第一类错误和第二类错误,但是错误类别其实比上图要复杂那么一点点。大家想一下要是你的问题或者零假设是“这个人没有怀孕”呢?!

Showcase 2. using logistic regression to predict if your salary is gonna be more than 50K

学习案例2.用逻辑斯蒂回归模型预测个人收入是否会高于5万

Here, I use an example to tell you how it works. 下面我就给大家讲一下这个模型是怎么工作的。

The dataset I use here was downloaded from UCI, it’s about 35,000 data records, and the dataset structure looks like the following graph. We have variables of age, type of employer, education and educational years, marital status, race, work hour per week, original country, and salary.  This is just a showcase for studying logistic regression. 

这个数据是从UCI下载来的,大概有3.5万条数据记录,数据格式看起来就是下图这样的。变量包括了个人年龄,雇主类型,教育情况和教育年限,婚姻状况,种族,每周工作小时数,原国籍和收入情况。这个数据只是用来学习逻辑斯蒂回归模型的,本人对结论不负责哦。

  1 raw data.png

Let’s see some interesting patterns of the data, the correlation between salary categories (<50k, >50k) and education, race, sex, marital status, etc., before we go into the logistic model. 

在跑逻辑斯蒂回归模型之前,让我们来看看个人的收入(薪水)类别(年薪大于五万和小于五万)和教育,种族,性别,婚姻状况都有什么联系。

Rplot06

People who are married tend to earn more than >50k than people who never married or currently not married. 

结婚了大人可能收入大于5万的总人数会比不婚族和还没有结婚的人要高。

Rplot03

A lot more people earn less than 50k when they are about 25 years old, and people who are age between 40 to 50 are likely to earn more than 50k. 

大部分年纪在25岁左右的人主要收入都少于5万,收入大于五万的人一般都在40岁到50岁之间。

Rplot04

Earning more than 50k or less is not depends on longer hours you work per week.

其实不管一周工作多少个小时,收入也还是不会改变多少呢。

Rplot07

People who get more years of education earn a bit more doesn’t matter it’s male or female, of course, you can’t tell that if you would earn more with more education as well. 

受教育多的人普遍工资都偏高,不管男性还是女性。但是也不能说明受教育年限越高就说明收入越高。

Rplot08

More people are employed in private sectors, and doesn’t matter where the person are employed, women are likely to be in the salary category of <50k. It means in the same type of employers; women are likely to be paid less.  

数据中受雇于私人部门的人更多,而且其中男性雇员的年薪大于5万的人要比女性更多,不管他们受雇于哪一个部门。也就是说,女性在同样的工种之中可能拿到的工资比男性要低。

model.png

Before running the logistic regression, I split the dataset into 2 parts: training dataset and testing dataset. Training data takes up about 70 percent of the whole dataset. After running the model, I use the testing data to predict if my model/algorithm is good enough. This is when we will find out from the rate of Type I error and Type II error. For detail R codes I wrote you could go to my GitHub.

模型检测方法就是,在建立模型之前要把我们收集到的数据分成模型数据和检测数据。模型数据一般是整个数据的70%左右,但是这个也不一定,随你怎么定都行。一般模型运行完成之后,我们就需要把检测数据带到模型中,通过对比真实记录的结果和模型预测的结果来检测模型是否是最好的模型。我写了详细的模型代码和检测代码,原始的代码在这里.

From the model (above graph) you see that some factors (variables) have positive impacts on income, e.g. age, married, but some have negative impacts, e.g. when a person’s education is between 4th to 9th grade or preschool…Since I tried not to confuse you all with the statistical part but if you wanna understand a bit more about the statistics of the algorithm I recommend you to read this book: An Introduction to Statistical Learning. You could go to Chapter 4 particularly at this book for the logistic regression. 

通过上图的模型结果大家可以看到有些变量对于个人收入的预测是有正面的影响的,比如年龄,结婚等,另外有一些又是有负面影响的,比如受教育低。这个博客写作我还是忽略了很多统计的部分。但是如果大家想了解逻辑斯蒂回归模型可以去看An Introduction to Statistical Learning 这一本书,书中讲很多R在统计学习中的应用。关于逻辑斯蒂回归模型大家直接可以跳到第四章去学习。

If we wanna know the algorithm I built was a good one, I need to test the model and these following parameters will give me an answer to it.  For example, the accuracy of the model is measured by the proportion of true positive and true negative in the whole dataset.

关于我们建立的这个模型是否是个好模型那么就需要这几个参数来考量。所选模型的精确度就靠一下图中的accuracy(精确度)公式来确定了。

azure-machine-learning-intro-18-638

 

There are three categories of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Logistic regression and linear regress have belonged to the supervised learning algorithm. 

我们其实可以把机器学习的算法归为三类,分别是:监督学习,非监督学习以及加固学习。我的两个博客中提到的多元线性模型和逻辑斯蒂回归模型属于机器学习中的监督学习算法。

My best self-taught strategy is ‘learning by doing’—‘get your hand dirty’ is always the best way to get good at of somethings you wanna master, and I have so much fun learning what algorithm and statistics behind machine learning, and here are some great blogs to read too. If you are interested in learning more, you could follow my blog or twitter: @geonanayi

我自学的宗旨是在‘动手过程中学习’, ‘get your hand dirty’永远都是最好的学习和巩固知识的最好方式。做这些案例学习真的是学习到很多背后的统计和数学方法。大家如果有时间也可以读一读这个博客,如果你想要和我一起学习“机器学习的算法”也可以加我的Twitter:@geonanayi

 

 

 

 

Data-driven​ city planning: Using multiple linear regression to predict bike sharing demands in Washington DC 数字化城市计划:多元线性回归帮您预测华府自行车租用情况

Bike sharing system is a convenient and clean way to get around the cities through obtaining membership, rental and bike return. It’s getting popular among high populated cities globally.

现在很多城市都已经开始使用自行车租用系统。自行车租用是通过购买会员,租车和还车的一个系统。

bikes

(this picture is from http://kaggle.com)

(图片来源于http://kaggle.com)

While I was in Chicago for a business trip, my favourite activity was riding the rental bike around the Lake Michigan. In a bright autumn afternoon with few cloud and the air cools down, it’s definitely a perfect time to rent a bike to explore an unknown city as a tourist. However, it’s quite frustrated while millions of tourists are renting the bikes. It’s unpleasant if there are no bike left and also too many bikers sharing a tiny bike lane. To find a nice fall afternoon and not many people on the road make a perfect sense while you’re a tourist in an exciting city you wanna explore.

我在芝加哥出差的时候最难忘的经历便是骑着自己租用的自行车环着密歇根湖环行。如果在午后空气凉爽的秋季骑上自行车到处逛逛你没有住过的城市,该是多么美好的一件事情啊。但是对于一个普通的只想探索一个未知城市的旅行者来说,过多的自行车骑客拥挤在一条小小的自行车道上也还是蛮郁闷的。所以对于游客来说找一个秋季凉爽的天气,没有多少其他行人出去骑车体验未知的城市那是多美好,对吧!

As a city bike sharing manager, you wanna share the as many bikes as with potential riders, and I am sure you will have such concerns:

但是对于一个自行车租用系统的管理者来说,他担忧的内容又完全和游客考虑的内容是不一样的。对于管理者他可能会有以下的顾虑:

  1. How many bikes are actually needed in the city bike sharing system? 我们的城市自行车租用系统到底需要多少自行车?
  2. If the bike demand varies every day according to the temp, weather, holiday, and humidity?每一天自行车的租用情况随着气温,天气,假期和空气湿度等等的变化是如何变化的?

It will be most cost-efficient that the city won’t provide too many bikes and it’s important not to run short.所以对于一个管理者来说,用经济的自行车租用数量最好的服务市民才是最重要的。

Picture1

Here, I got the data from http://kaggle.com, it about 11,000 records in the dataset. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. Capital Bikeshare is a bike sharing system in Washington DC that aims to rent a bike for people who are going to Metro, to work, run errands. It has more than 3000 bikes in the system for over 350 stations across Washington DC, Arlington and Alexandria, VA and Montgomery County, MD and it could be returned to any station near your destination. I have not used bikes in DC yet and might be worth to try, it’s free for first 30mins.

这个数据来源是http://kaggle.com,一共有两年11,000个自行车使用情况的记录。原始数据由华府(美国首都华盛顿特区)的华府自行车共享的Hadi Fanaee Tork提供。华府自行车共享系统是针对居民出行的需要(去地铁,去工作,购物)设置的。整个系统有350多个租用站3000多辆自行车,分别分布于华府,佛吉尼亚州的阿灵顿,压力山大港,以及马里兰州的蒙哥马利县各处。 虽然住在华府附近已久但是我自己还没有使用过这个系统里的自行车,听说最开始的30分钟是免费的—值得一试。

Map

Some result from the data analysis

初始数据分析结果

 

Rplot02

(1-Spring; 2-Summer, 3-Fall; and 4-Winter)

(图中1是指春季,2是指夏季,3是秋季,4是冬季)

From the graph, we could tell that more people are using bikes during the fall, and least people are biking around during the spring time.从上图可以看出来秋季是人们最喜欢骑自行车但是春季是最不喜欢骑自行车的季节。

Rplot01

Through the year, bike demand starts to climb after Apri and decline after Oct. The demand pick is around Sep at least from 2 years data records.

从一整年的情况来看人们最喜欢骑自行车的月份开始于4月然后到10月份就开始下降了。需求量的最高峰出现在9月份。

When I replot the data to 24 hours for working days, from the midnight to 23:00, the pattern of bike demand could be seen as 1) while the temperature rises more people are on bike; 2) there are two peaks of bike demands in a work day, which is morning time around 8am and afternoon around 18am; 3) People like to use bike during the lunch time while the temperature is warmer than 20 degrees.

如果把自行车的需求量按照一天24小时来作图,这个提取的数据是工作日的自行车需求量。那么从下图我们可以分析出一些规律:1)气温升高的时候骑自行车的人也变多了;2)在工作日的24小时里头早上和下午出现两个使用自行车的高峰期,不高气温高低;3)在午餐休憩期间也有不少人使用自行车呢,特别是温度高于20度之后使用的人似乎更多。

Rplot03

However, the bike demand looks a bit different when it was a holiday. The maximum of bike demand was not that high compares to the working days, which means residents in DC area are using bikes. The demand for the holiday is more spread out than work days, and it slowly starts after 8 am when the temperature is pleasant, and the demand peak appears around 13:00 to 17:00.

但是在假期的时候自行车的使用情况和工作日还是有所不同的。至少从需求量来看假期的自行车需求在最高峰的时候没有工作日多,但是高峰期更宽时间跨度更长。这个高峰期主要集中出现在下午1点到5点之间。

Rplot04

From the above graphs, you might find we only have dug out the bike demands, which is label as “count” in the dataset, together with temperature (mainly). If we wanna make a prediction of how many bikes we actually need for each day, and just imagine that any condition you don’t wanna ride a bike in DC. If I only speak for myself, I don’t wanna ride a bike: 1). When it’s too cold out there (Oops, topical people); 2) too humid; 3) too windy; 4) it’s rainy hard; 5) too many people out there riding bikes.

上面几个图中我们只是观察了自行车需求量和气温的关系。但是对于现实情况来看自行车需求量其实不只是和气温有关系。从我个人角度来讲,以下情况下我就不可能在外头骑自行车:1)外面太冷 (热带人们怕冷);2)外头湿度太大,黏答答的有没有?!;3)风太大(毕竟人比较瘦,嘻嘻);4)雨下得太大了;5)其实汽车人太多我也不喜欢哎~

To make a prediction like a bikshare system manager, we need to know the correlation of each pair of variables, which the pair between each of humidity, weather, workingday, windspeed, hour, holiday and so on. Therefore, I  produced a graph to pair out the correlation for each pair of variables. The blue colour represents positive correlation, and red colours mean negative correlation. For example, looking at the column of ‘count'(it’s the bike demand I mentioned above), it has positive correlation with temperature (‘temp’) in the graph, which means when temperature goes up people like to ride bike, but it has negative correlation with humidity (‘humidity in the graph’), which indicates people would not like to ride a bike during a high humidity time. Note: this is a linear regression, which means I just assume each pair of variables is linearly correlated, which could not actually reflect the reality sometimes. For example, I could not bike outside while it’s too hot, but the regression tells that  people would love to bike even more while it’s actually hot (with positive correlation).

那么要像一个自行车租用管理者一样思考,我们就要知道我提的以上的变量彼此之间怎么互相影响,对吧。所以我又做了一个图,对比两两变量之间的关系是怎么样的。下图的蓝色代表的是正相关,红色系表示的是负相关。我们就可以看,自行车的需求量(图中的“count”)那一栏对到humidity(空气湿度)的饼图就发现他们其实就是负相关,意思就是如果外面湿度太大在外面汽车的人就越少,反过来说就是这时候自行车的需求量就小了。那对着看count 和temp(温度)的关系就发现他们是正相关,正相关的意思就是温度越高我就越爱在外头骑自行车,所以对于一个城市来说自行车的需求量就高了。当然我们这里做的是线性回归。线性回归的意思就是,我们都是直来直去的关系不拐弯抹角。但是其实这不能反映现实情况,比如温度直线上升我怎么会喜欢在外头骑自行车呢,对吧?!但是可能华府它气温就不可能太高,或者说气温高的天数太少了从真个数据(样本)来看对整体不构成影响。

Rplot05

I ran the linear regression between bike demands and the variables above and had this blowing regression. It will be able to help us to predict how many bikes we actually need in The Capital Bikeshare system each day, according to the weather, holiday, and temperature, etc. As a tourist, you could also predict if you wanna go out today according to the weather prediction and rough prediction of how many bikes are going to be around in the city.

从下头的多元线性回归中,我们便可以依据每天天气情况,是否是假期等等因素来考虑每天自行车的需求量,这也就是一个自行车租用系统管理者关心的。但是反过来,作为游客我们也可以依照天气预报大概估算一下今天在街上租用自行车的人大概有多少人,如果喜欢热闹就选在人相对多的时候出门如果不喜欢热闹怕吵那就在人少的时候出门。

linear regression

At this point we could make a prediction/assumption: Today, it’s fall now, and holiday; the weather is clear, few clouds; temp is 30, but air temp is about 34; humidity is about 70%; weed seed is about 2, and it’s close to 16:00 pm now. So we could predict how many bikes are needed for the particular hour, day and weather.

The answer is 781 bikes.

My R codes could be found here: http://rpubs.com/Geoyi/BikeshareDC_LM.

到这个点上,我们就可以大概预测:今天是秋天里气候凉爽,少云;气温在30度左右,湿度为70%,风速不大大概在2左右,然后现在快要下午四点了,而且还是不用工作的假期。那么从上么的公式我们就可以大概预测,今天在周围活动的自行车大概是781辆。

我的数据分析R程序代码在这里:http://rpubs.com/Geoyi/BikeshareDC_LM。

Statistics is quite useful, isn’t it?!

统计是不是很有用呢?!

Global Zika virus epidemic from 2015 to 2016: A big data problem- 大数据分析全球Zika病毒传染

Centers for Disease Control and Prevention (CDC) provided Zika virus epidemic from 2015 to 2016,  about 107250 observed cases globally, to kaggle.com. Kaggle is a platform that data scientists compete on data cleaning, wrangling, analysis and provide the best solution for big data problems.

美国疾病传染防控中心 (CDC) 给大数据分析师们提供了一个记录有十多万个全球Zika病毒传染案例。这个数据传到了Kaggle网站上,Kaggle网站是一个大数据分析比赛和数据共享平台。

Zika virus epidemic problem is an interesting problem, so I took the challenge and coded an analysis in RStudio.  However, after finishing a rough analysis, I found that this could be an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic. Because the raw data has not been cleaned and clarified yet, and the raw data description could be seen here.

我觉得这个挑战还蛮有意思的,所以也下载了数据来分析看看。这个博客里头提供的是我初始分析的一些结果。但是必须提前申明的一点是:由于CDC提供的原始数据本身还是满粗糙也有很多记录不明晰的地方,所以我的这个分析以其说是一个解决方案不如说是一个纯粹的大数据分析案例。

A bit of background of Zika and Zika virus epidemic from CDC.

  • Zika is spread mostly by the bite of an infected Aedes species mosquito (Ae. aegypti and Ae. albopictus). These mosquitoes are aggressive daytime biters. They can also bite at night.
  • Zika can be passed from a pregnant woman to her fetus. Infection during pregnancy can cause certain birth defects, e.g. Microcephaly.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed.
  • There is no vaccine or medicine for Zika yet.

关于Zika和Zika病毒传染的一些背景知识:

  • Zika由通过Aedes蚊虫叮咬传播(主要是该蚊子的两个分种:Ae. aegypti 和Ae. albopictus 传播)。该蚊虫叮咬主要发生在白天,当然也会发生在晚上。
  • Zika的危险之处是病毒可以通过怀孕的母亲传给其腹中的婴孩。病毒可以影响胎儿正常的神经发育而引起生育缺陷,包括现在被发现和报道的小头症。
  • 目前可预防Zika的药物和预防针还没有。

Initiative outputs from the data analysis 初始的分析结果

Firstly, let see the animation of the Zika virus observations globally. The cases observations were started recorded from Nov. 2015 to July 2016. At least from the documented cases during the period, it started from Mexico and El Salvador, and it spread to South American countries and the USA. The gif animation makes the data visualization looks fancy, but while I looked deeply, the dataset need a serious cleaning and wrangling.

CDC提供的数据采集于2015年11月到2016年7月份之间。从下图动画中可以看出这段时间之内Zika的传播是从墨西哥和萨尔瓦多两个国家开始传播的。虽然这个动图让传染病从一个国家到另一个国家的传播速度更为明了,但是其实仔细看下来CDC提供的这个原始的数据却还是需要特别清理的。换句话来说就是数据采集,和记录挺混乱。

Zika_ani.gif

Raw data 原始数据用Excel表格打开的样子

dataset screenshot

The raw data was organized by report date, case locations, location type, data field/data category,  the field code, period, its types, value (how observations/cases), the unit.

原始数据的记录记录是每一个Zika案列发生的时间,地点,地点类型(是区域还是省级的),案例类型,类型代码,发生的时段,发生的类型,以及案列数等等。

Rplot

While I plotted the cases by counties from 2015 to 2016, we could see most of Zika epidemic cases were observed much more in 2016 especially in South American countries. Colombia had by far the most reported Zika cases. Puerto Rico, New York, Florida and Virgin Islands of USA have reported Zika cases so far.  During this data recorded period 12 countries were reported had Zika virus cases, from most reported cases to the least these countries are: Colombia (86,889 reported cases), Dominican Republic (5,716), Brazil (4,253),  USA(2,962), Mexico (2894),  Argentina (2,091 ), Salvador (1,000), Ecuador(796), Guatemala (516), El   Panama(148) , Nicaragua (125) and Haiti (52). See the below map.

把原始数据按照记录直接用来作图的话就会发现Zika传染病被报道的案例从2015年到2016年有一个数量级的爆发。换句话来说就是2016年的数量比2015年要多很多(不过2015年的数据记录才从11月份开始,所以其实也不足以说明问题)。哥伦比亚这个国家Zika被报道的案例在2016年是全球最高的。美国的话也有近3000个案例被记录在案,其中波多黎各,纽约,佛罗里达和维京各岛屿相继都有Zika案例报道。从全球传播来看亚洲欧洲被报道的案例数没有被包括在这个数据之中,而有12个北美,中美和南美的国家被大量报道Zika病毒的传播。这12个国家和这些国家被记录的Zika案例数量从最高到最低来看分别是:哥伦比亚 (86889 报道案例),多米尼加共和国(5716),巴西(4253),美国(2962),墨西哥(2894),阿根廷(2091 ),萨尔瓦多(1000),厄瓜多尔(796),危地马拉(516),巴拿马(148),尼加拉瓜 (125)和海地(52)。请看一下地图。

Rplot01

However, while I went back to organize the reported Zika cases for each country, I found the data recorded for each country was not consistent. It’s oblivious that the each country has their strengths and different constraints for tracking Zika epidemic. Let’s see some examples:

所以我接下来想要看的就是每个国家记录的Zika案列都可以怎么分类。但是其实从下图就可以看出来每个国家对于案例的追踪和记录还是有所差别的,可能和每个国家负责记录数据,追踪案例的机构都不同有关系。大家可以通过以下各图来了解一个究竟:

Rplot14Rplot13Rplot12Rplot11Rplot10Rplot09Rplot08Rplot07Rplot06Rplot05Rplot04Rplot03Rplot02

In the states, most of the reported cases are from travel. But I am confused that aren’t the confirmed fever, eye pain, headache cases overlapped with zika reported, and zika_reported travel were included in yearly_reported_travel_cases. If so, were the cases were overestimated for most of the countries. Probably only CDC could explain the data much better from medical conditions and epidemic perspective.

就比如在美国被报道最多的案例类型中,其实是旅游相关的,就是病毒传染者去过病毒传播比较猖狂的国家。但是数据记录类型来看有症状相关的记录比如确定发烧,眼睛疼和头疼的案列,难道这些案列不是和已经怀疑或者的确诊的案列是重合的吗?难道眼睛疼和发烧是两个独立的案例和症状?所以有此就可以看出CDC提供的原始数据本身在分析之前是需要好好的理解也需要好好的清理一下的。或者数据记录都正确,但很多让人不解的地方似乎也只有CDC自己出来解释了。

From the reported cases that Microcephaly cases caused by Zika virus were only founded in Brazil and Dominic Republic.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed. The child’s brain stops growing as it should. People get infected with Zika through the bite of an infected Aedes species mosquito (Aedes aegypti and Aedes albopictus). A man with Zika can pass it to sex partners but there was a case that a woman who infected with Zika virus has been found passed the virus to her partner too.

从发生的Zika案例来看Zika病毒感染引起的小头症(Microcephaly )目前只有在多米尼加共和国和巴西这两个国家被确诊和报道过。小头症是一种病毒感染而阻止婴孩神经系统正常发育,而引起的不正常头部发育。小头症顾名思义就是婴孩脑子的发育比正常发育的头要小,婴孩的脑子停止发育造成的。所以准备怀孕和已经怀孕的妇女其实应该避免到这些国家履行。现在已经被报道Zika病毒除了通过蚊虫叮咬传播其实通过性交也是可以传播的。之前报道只发现感染病毒的男性通过性交会把病毒传给其女伴,但是最近有一个案例也说明感染病毒的女性同样也可以通过性交传播病毒给其男伴。

My original R codes could be accessed here; first gif animation graph was originally coded by a UK-based data scientist Rob Harrand, and I only edit the data presented interval and image resolution.

这也算是一个非常粗糙的分析,但是如果大家对我的原始分析程序感兴趣,请移步这里。这个博客中使用的动图原始程序是英国大数据分析师Rob Harrand做的,我只是改了他的参数还有生成的动态图的尺寸。当然除了动图之外其他程序都是我写的,如果有需要请注明出于geoyi.org.

Note: Again, this is an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic, because the raw data from CDC still need seriously cleaning. For more insight, please follow CDC’s reports and cases recorded.

注明:再一次重申这个大数据分析以其说是给CDC做的完整的分析不如说是一个纯粹的大数据分析案例。因为大家可以看到其实这个原始数据是需要特别清理的,而且部分数据应该只有CDC他们自己才能够解释清楚的。如果大家感兴趣可以去看看CDC相继的报道以及数据记录。

 

PV Solar Installation in US: who has installed solar panel and who will be the next?

Project idea

Photovoltaic (PV) solar panels, which convert solar energy into electricity, are one of the most attractive options for the homeowners. Studies have shown that by 2015, there are about 4.8million homeowners had installed solar panels in the United States of America. Meanwhile, the solar energy market continues growing rapidly. Indeed, the estimated cost and potential saving of solar is the most concerned question. However, there is a tremendous commercial potential for the solar energy business, and visualizing the long term tendency of the market is vital for the solar energy companies’ survival in the market . The visualization process could be realized by examining the following aspects:

  1. Who has installed PV panels, and what are the characteristics of the household, e.g. what’s the age, household income, education level, current utility rate, race, home location, current PV resource, existing incentive and tax credits for those that have installed PV panels?
  2. What does the pattern of solar panel installation looks like across the nation, and at what rate? Which household is the most likely to install solar panels in the future?

The expected primary output from this proposal is a web map application . It will contain two major functions. The first is the cost and returned benefit for the households according to their home geolocation. The second is interactive maps for the companies of the geolocations of their future customers and the growth trends.

Initial outputs


The cost and payback period for the PV solar installation: Why not go solar!

NetCost

Incentive programs and tax credits bring down the cost of solar panel installation. This is the average costs for each state.

Monthly Saving

Going solar would save homeowners’ spending on the electricity bill.

Payback Years

Payback years vary from state to state, depending on incentives and costs. High cost does not necessarily mean a longer payback period because it also depends on the state’s current electricity rate and state subsidy/incentive schemes. The higher the current electricity rate, the sooner you would recoup the costs of solar panel installation. The higher the incentives from the state, the sooner you will recoup the installation cost.

How many PV panels have been installed and where?

Number of Solar Installation

The number of solar panels installed in the states that have been registered on NREL’s Open PV Project. There were about 500,000 installations I was able to collect from the Open PV Project. It’s zip-code-based data, so I’ve been able to merge it to the “zip code” package on R. My R codes file is added here at my GitHub project.

Other statistical facts : American homeowners who installed solar panels generally has $25,301.5higher household income compare to the national household income. Their home located in places that have higher electricity rate, about 4 cents/kW greater than the national average, and they are also having higher solar energy resource, about 1.42 kW/m2 higher than the national average.

Two interactive maps were produced in RStudio with “leaflet”

Solar Installation_screen shot1

An overview of the solar panel installation in the United States.

Solar Installation_screen shot2

Residents on the West Coast have installed about 32,000 solar panels from the data registered on the Open PV Project, and most of them were installed by residents in California. When zoomed in closely, one could easily browse through the details of the installation locations around San Francisco.

Solar Installation_screen shot3

Another good location would be The District of Columbia (Washington D.C.) area. The East Coast has less solar energy resource (kW/m2) compared to the West Coast, especially California. However, the solar panel installations of homeowners around DC area are very high too. From maps above, we know that because the cost of installation is much lower, and the payback period is much faster compared to other parts of the country. It would be fascinating to dig out more information/factors behind their installation motivation. We could zoom in too much more detailed locations for each installation on this interactive map.

However, some areas, like DC and San Francisco, have a much larger population compared to other parts of US, which means there are going to be much more installations. An installation rate per 10,000 people would be much more appropriate. Therefore, I produced another interactive map with the installation rate per 10,000 people, the bigger the size of the circle is the higher rate of the installation.

Solar Installation_screen shot4

The largest installation rate in the country is in the city of Ladera Ranch, located in South Orange County, California. Though, the reason behind it is not clear and more analysis is needed.

Solar Installation_screen shot5

Buckland, MA has the highest installation on the East Coast. I can’t explain what the motivation behind it yet either. Further analysis of the household characteristics would be helpful. These two interactive maps were uploaded tomy GitHub repository, where you will be able to see the R code I wrote to process the data as well.

Public Data Sources

To answer these two questions, datasets of 1670M (1.67G) were downloaded and scraped from multiple sources:
(1). Electricity rate by zip codes;

(2). A 10km resolution of solar energy resources map, in ESRI shapefile, was downloaded the National Renewable Energy Laboratory (NREL); It was later extracted by zipcode polygon downloaded from ESRI ArcGIS online.

(3). Current solar panel installation data was scraped from the website of open PV website, a collection of installations by zip code. It requires registration to be able to access the data. It is part of NREL. The dataset includes the zip code of the installation, the cost, the size of the installation and the state of each location.

(4). Household income, education, the population of each zip code was obtained from US census.

(5). The average cost of the solar installation for each state was scraped from the website: Current cost of solar panels and Why Solar Energy? More of datasets for this proposal will be downloaded from the Department of Energy on GitHub via API.

Note: I cannot guarantee the accuracy of the analysis. My results are based on two days of data mining, wrangling, and analysis. The quality of the analysis is highly depended on the quality of the data and on how I understood the datasets in such limited time. A further validation of the analysis and datasets is needed.

For further contact the author, please find me on https://geoyi.org; or email me:geospatialanalystyi@gmail.com.