A bit of crazy machine learning things and my showcase 2-using logistic regression to predict the income category 神奇的机器学习以及逻辑斯蒂回归模型案例

Uber will offer self-drive cars in Philly this Nov., and soon or later you will get a ride in a Uber that pops up in your doorway without a human driver. It’s so fascinating but crazy at the same time. It sounds like a science fiction, but definitely, it will be real soon. What has brought this to reality partially is machine learning, and it definitely deserves a credit. 


What’s machine learning? It’s a way we teach a computer to learn from thousands and millions of data records, to find patterns or rules, so it could behave/finish a task the way we want it. It is very similar to we teach babies or pets how to learn things. For example, we teach a computer in the self-drive car to remember the roads, and how to navigate in the cities for thousands of times, so it learns how to drive, so it could behave the way we want it. Let’s wait to see how the users’ reviews of Uber self-drive car this Nov. 

那什么是机器学习呢?机器学习和教你的小孩和宠物学习新东西其实是无异的呢,只是机器学习里面的学生是电脑而已。就像我们说的无人驾驶汽车里面使用的电脑可以通过反复学习一个城市的路况,而再也不需要人类司机了。但是究竟这个电脑司机能比人类司机好多少倍当然就不得而知了, 所以大家就拭目以待今年11月份不同的优步用户的感受吧。

If we said, babies grow knowledge from EXPERIENCEs, and then a computer, with machine learning algorithm, learns from thousands and millions of data records. From the past (and only can be from the past because we don’t have data records from future) data records, it finds the pattern or courses that could be repeated in the future. It’s part of artificial intelligence (AI). 


Machine learning algorithms are used commonly in our daily life. The recommendations from our current favorite websites, e.g. spam emails identification,  your favorite movie/TV list from Amazon or Netflix, favorite songs from Spotify or Pandora. Credit card companies could spot a fraud when the credit card is used in an unusual location according to your past spending records.  Several startup companies already using the algorithms to help the customers to pick up clothes according to their personal tastes. The algorithm behind the pattern sorting is Machine Learning. In these case, you would wonder how computer learns about your favors and tastes if you only use the services for several times, but don’t forget there are millions and billions of people as the data points. To a computer or an algorithm particularly, your eating, learning, tasting and other habits are the data points together with other millions of data points (users). You could be learned from your habits but also could be studied from other users in the algorithm data cloud.  The accuracy of the algorithm really depends on the algorithm and the person who set the rules, though. 


Machine learning sounds very fancy and cutting edge but it’s not, in term of methodologies using is close to data mining and statistics, which means you could apply any statistical and mathematical methodologies you’ve learned from school. Machine learning is not about what computer languages you use to code, or it’s run on a super computer, but the essential is all about the algorithm. However, it’s very fancy in the way that the data scientists could dig out the best algorithms/ pattern from data that could assist us in a better decision on the daily basis, or you don’t even need to make a decision yourself but could just ask the Apps or your computer. 


These are a series of blogs that I try to write. The ultimate goal is, of course, to unlock what the popular algorithms that behind machine learning. I’ve presented a showcase in my last blog, which is the bike demand prediction of Capital Bikeshare, using multiple linear regression. This blog will be the showcase 2 of logistic regression. Even though you might think logistic regression is a kind of regressions, but it’s not. It’s a classification method; it’s used to answer YES or NO, e.g. is this patient has cancer or not; is this a bad loan or not. That’s when the false positive and false negative come in, or called them Type I error and Type II error in statistics. When you read about what it’s actually about, your math teacher might say “Type I error, and Type II error are where a positive result corresponds to rejecting the null hypothesis, and a negative result corresponds to not rejecting the null hypothesis.” And….ZZzzzz… then you fell asleep and never understood what they are. 

其实我想写一系列的博客来解读机器学习这个东西,毕竟我也是统计渣而且也正在学习。主要的目的还是想通过博客写作的方式让大家(其实最主要是我自己)了解机器学习更深刻一些。我上一个博客中写到的自行车租用系统算是这一系列博客里面的第一篇吧,如果大家对机器学习感兴趣,我建议你去看一下上一篇的博客。那这个博客就算是学习案例2吧,说的是逻辑斯蒂回归模型。在过去的统计学习课上,大家可能会以为逻辑斯蒂回归模型是回归中的一种,但是其实逻辑斯蒂回归模型是一种分类方法学,是用来判断“是”或者“不是”的,比如医学中常用来判断,这个病人是不是得了癌症;银行用来判断这个贷款是不是坏账。谈到这里,那就不得不提统计学中的第一类错误和第二类错误(统计学大虾们,是这么翻译的么?!)就是false positive (故障阴性) 和 false nagative(假阳性)—什么鬼!!然后你的统计学老师就会说:第一类错误就是你的阳性结论否定你的零假设,和第二类错误是你的阴性结论否定你的零假设,然后就在怒吼一次—什!!!么!!鬼!!!!然后就直接晕厥在课堂上再也不记得老师接下来讲了什么了,是吗?!

Here is a good way to remember them. 


If you are a question/make a hypothesis that ‘this person is pregnant’. Later you collected a tremendous amount of data to test your hypothesis, and here is the example what ”False Positive’ and ‘False Negative’ is: 



(Graph from https://effectsizefaq.com/category/type-i-error/ )

Note: Don’t stop here, the actual Type I error, and Type II error are a bit more complicated than this graph but hope it helps you to remember them as it does to me. 


Showcase 2. using logistic regression to predict if your salary is gonna be more than 50K


Here, I use an example to tell you how it works. 下面我就给大家讲一下这个模型是怎么工作的。

The dataset I use here was downloaded from UCI, it’s about 35,000 data records, and the dataset structure looks like the following graph. We have variables of age, type of employer, education and educational years, marital status, race, work hour per week, original country, and salary.  This is just a showcase for studying logistic regression. 


  1 raw data.png

Let’s see some interesting patterns of the data, the correlation between salary categories (<50k, >50k) and education, race, sex, marital status, etc., before we go into the logistic model. 



People who are married tend to earn more than >50k than people who never married or currently not married. 



A lot more people earn less than 50k when they are about 25 years old, and people who are age between 40 to 50 are likely to earn more than 50k. 



Earning more than 50k or less is not depends on longer hours you work per week.



People who get more years of education earn a bit more doesn’t matter it’s male or female, of course, you can’t tell that if you would earn more with more education as well. 



More people are employed in private sectors, and doesn’t matter where the person are employed, women are likely to be in the salary category of <50k. It means in the same type of employers; women are likely to be paid less.  



Before running the logistic regression, I split the dataset into 2 parts: training dataset and testing dataset. Training data takes up about 70 percent of the whole dataset. After running the model, I use the testing data to predict if my model/algorithm is good enough. This is when we will find out from the rate of Type I error and Type II error. For detail R codes I wrote you could go to my GitHub.


From the model (above graph) you see that some factors (variables) have positive impacts on income, e.g. age, married, but some have negative impacts, e.g. when a person’s education is between 4th to 9th grade or preschool…Since I tried not to confuse you all with the statistical part but if you wanna understand a bit more about the statistics of the algorithm I recommend you to read this book: An Introduction to Statistical Learning. You could go to Chapter 4 particularly at this book for the logistic regression. 

通过上图的模型结果大家可以看到有些变量对于个人收入的预测是有正面的影响的,比如年龄,结婚等,另外有一些又是有负面影响的,比如受教育低。这个博客写作我还是忽略了很多统计的部分。但是如果大家想了解逻辑斯蒂回归模型可以去看An Introduction to Statistical Learning 这一本书,书中讲很多R在统计学习中的应用。关于逻辑斯蒂回归模型大家直接可以跳到第四章去学习。

If we wanna know the algorithm I built was a good one, I need to test the model and these following parameters will give me an answer to it.  For example, the accuracy of the model is measured by the proportion of true positive and true negative in the whole dataset.




There are three categories of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Logistic regression and linear regress have belonged to the supervised learning algorithm. 


My best self-taught strategy is ‘learning by doing’—‘get your hand dirty’ is always the best way to get good at of somethings you wanna master, and I have so much fun learning what algorithm and statistics behind machine learning, and here are some great blogs to read too. If you are interested in learning more, you could follow my blog or twitter: @geonanayi

我自学的宗旨是在‘动手过程中学习’, ‘get your hand dirty’永远都是最好的学习和巩固知识的最好方式。做这些案例学习真的是学习到很多背后的统计和数学方法。大家如果有时间也可以读一读这个博客,如果你想要和我一起学习“机器学习的算法”也可以加我的Twitter:@geonanayi





Data-driven​ city planning: Using multiple linear regression to predict bike sharing demands in Washington DC 数字化城市计划:多元线性回归帮您预测华府自行车租用情况

Bike sharing system is a convenient and clean way to get around the cities through obtaining membership, rental and bike return. It’s getting popular among high populated cities globally.



(this picture is from http://kaggle.com)


While I was in Chicago for a business trip, my favourite activity was riding the rental bike around the Lake Michigan. In a bright autumn afternoon with few cloud and the air cools down, it’s definitely a perfect time to rent a bike to explore an unknown city as a tourist. However, it’s quite frustrated while millions of tourists are renting the bikes. It’s unpleasant if there are no bike left and also too many bikers sharing a tiny bike lane. To find a nice fall afternoon and not many people on the road make a perfect sense while you’re a tourist in an exciting city you wanna explore.


As a city bike sharing manager, you wanna share the as many bikes as with potential riders, and I am sure you will have such concerns:


  1. How many bikes are actually needed in the city bike sharing system? 我们的城市自行车租用系统到底需要多少自行车?
  2. If the bike demand varies every day according to the temp, weather, holiday, and humidity?每一天自行车的租用情况随着气温,天气,假期和空气湿度等等的变化是如何变化的?

It will be most cost-efficient that the city won’t provide too many bikes and it’s important not to run short.所以对于一个管理者来说,用经济的自行车租用数量最好的服务市民才是最重要的。


Here, I got the data from http://kaggle.com, it about 11,000 records in the dataset. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. Capital Bikeshare is a bike sharing system in Washington DC that aims to rent a bike for people who are going to Metro, to work, run errands. It has more than 3000 bikes in the system for over 350 stations across Washington DC, Arlington and Alexandria, VA and Montgomery County, MD and it could be returned to any station near your destination. I have not used bikes in DC yet and might be worth to try, it’s free for first 30mins.

这个数据来源是http://kaggle.com,一共有两年11,000个自行车使用情况的记录。原始数据由华府(美国首都华盛顿特区)的华府自行车共享的Hadi Fanaee Tork提供。华府自行车共享系统是针对居民出行的需要(去地铁,去工作,购物)设置的。整个系统有350多个租用站3000多辆自行车,分别分布于华府,佛吉尼亚州的阿灵顿,压力山大港,以及马里兰州的蒙哥马利县各处。 虽然住在华府附近已久但是我自己还没有使用过这个系统里的自行车,听说最开始的30分钟是免费的—值得一试。


Some result from the data analysis




(1-Spring; 2-Summer, 3-Fall; and 4-Winter)


From the graph, we could tell that more people are using bikes during the fall, and least people are biking around during the spring time.从上图可以看出来秋季是人们最喜欢骑自行车但是春季是最不喜欢骑自行车的季节。


Through the year, bike demand starts to climb after Apri and decline after Oct. The demand pick is around Sep at least from 2 years data records.


When I replot the data to 24 hours for working days, from the midnight to 23:00, the pattern of bike demand could be seen as 1) while the temperature rises more people are on bike; 2) there are two peaks of bike demands in a work day, which is morning time around 8am and afternoon around 18am; 3) People like to use bike during the lunch time while the temperature is warmer than 20 degrees.



However, the bike demand looks a bit different when it was a holiday. The maximum of bike demand was not that high compares to the working days, which means residents in DC area are using bikes. The demand for the holiday is more spread out than work days, and it slowly starts after 8 am when the temperature is pleasant, and the demand peak appears around 13:00 to 17:00.



From the above graphs, you might find we only have dug out the bike demands, which is label as “count” in the dataset, together with temperature (mainly). If we wanna make a prediction of how many bikes we actually need for each day, and just imagine that any condition you don’t wanna ride a bike in DC. If I only speak for myself, I don’t wanna ride a bike: 1). When it’s too cold out there (Oops, topical people); 2) too humid; 3) too windy; 4) it’s rainy hard; 5) too many people out there riding bikes.

上面几个图中我们只是观察了自行车需求量和气温的关系。但是对于现实情况来看自行车需求量其实不只是和气温有关系。从我个人角度来讲,以下情况下我就不可能在外头骑自行车:1)外面太冷 (热带人们怕冷);2)外头湿度太大,黏答答的有没有?!;3)风太大(毕竟人比较瘦,嘻嘻);4)雨下得太大了;5)其实汽车人太多我也不喜欢哎~

To make a prediction like a bikshare system manager, we need to know the correlation of each pair of variables, which the pair between each of humidity, weather, workingday, windspeed, hour, holiday and so on. Therefore, I  produced a graph to pair out the correlation for each pair of variables. The blue colour represents positive correlation, and red colours mean negative correlation. For example, looking at the column of ‘count'(it’s the bike demand I mentioned above), it has positive correlation with temperature (‘temp’) in the graph, which means when temperature goes up people like to ride bike, but it has negative correlation with humidity (‘humidity in the graph’), which indicates people would not like to ride a bike during a high humidity time. Note: this is a linear regression, which means I just assume each pair of variables is linearly correlated, which could not actually reflect the reality sometimes. For example, I could not bike outside while it’s too hot, but the regression tells that  people would love to bike even more while it’s actually hot (with positive correlation).

那么要像一个自行车租用管理者一样思考,我们就要知道我提的以上的变量彼此之间怎么互相影响,对吧。所以我又做了一个图,对比两两变量之间的关系是怎么样的。下图的蓝色代表的是正相关,红色系表示的是负相关。我们就可以看,自行车的需求量(图中的“count”)那一栏对到humidity(空气湿度)的饼图就发现他们其实就是负相关,意思就是如果外面湿度太大在外面汽车的人就越少,反过来说就是这时候自行车的需求量就小了。那对着看count 和temp(温度)的关系就发现他们是正相关,正相关的意思就是温度越高我就越爱在外头骑自行车,所以对于一个城市来说自行车的需求量就高了。当然我们这里做的是线性回归。线性回归的意思就是,我们都是直来直去的关系不拐弯抹角。但是其实这不能反映现实情况,比如温度直线上升我怎么会喜欢在外头骑自行车呢,对吧?!但是可能华府它气温就不可能太高,或者说气温高的天数太少了从真个数据(样本)来看对整体不构成影响。


I ran the linear regression between bike demands and the variables above and had this blowing regression. It will be able to help us to predict how many bikes we actually need in The Capital Bikeshare system each day, according to the weather, holiday, and temperature, etc. As a tourist, you could also predict if you wanna go out today according to the weather prediction and rough prediction of how many bikes are going to be around in the city.


linear regression

At this point we could make a prediction/assumption: Today, it’s fall now, and holiday; the weather is clear, few clouds; temp is 30, but air temp is about 34; humidity is about 70%; weed seed is about 2, and it’s close to 16:00 pm now. So we could predict how many bikes are needed for the particular hour, day and weather.

The answer is 781 bikes.

My R codes could be found here: http://rpubs.com/Geoyi/BikeshareDC_LM.



Statistics is quite useful, isn’t it?!


Global Zika virus epidemic from 2015 to 2016: A big data problem- 大数据分析全球Zika病毒传染

Centers for Disease Control and Prevention (CDC) provided Zika virus epidemic from 2015 to 2016,  about 107250 observed cases globally, to kaggle.com. Kaggle is a platform that data scientists compete on data cleaning, wrangling, analysis and provide the best solution for big data problems.

美国疾病传染防控中心 (CDC) 给大数据分析师们提供了一个记录有十多万个全球Zika病毒传染案例。这个数据传到了Kaggle网站上,Kaggle网站是一个大数据分析比赛和数据共享平台。

Zika virus epidemic problem is an interesting problem, so I took the challenge and coded an analysis in RStudio.  However, after finishing a rough analysis, I found that this could be an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic. Because the raw data has not been cleaned and clarified yet, and the raw data description could be seen here.


A bit of background of Zika and Zika virus epidemic from CDC.

  • Zika is spread mostly by the bite of an infected Aedes species mosquito (Ae. aegypti and Ae. albopictus). These mosquitoes are aggressive daytime biters. They can also bite at night.
  • Zika can be passed from a pregnant woman to her fetus. Infection during pregnancy can cause certain birth defects, e.g. Microcephaly.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed.
  • There is no vaccine or medicine for Zika yet.


  • Zika由通过Aedes蚊虫叮咬传播(主要是该蚊子的两个分种:Ae. aegypti 和Ae. albopictus 传播)。该蚊虫叮咬主要发生在白天,当然也会发生在晚上。
  • Zika的危险之处是病毒可以通过怀孕的母亲传给其腹中的婴孩。病毒可以影响胎儿正常的神经发育而引起生育缺陷,包括现在被发现和报道的小头症。
  • 目前可预防Zika的药物和预防针还没有。

Initiative outputs from the data analysis 初始的分析结果

Firstly, let see the animation of the Zika virus observations globally. The cases observations were started recorded from Nov. 2015 to July 2016. At least from the documented cases during the period, it started from Mexico and El Salvador, and it spread to South American countries and the USA. The gif animation makes the data visualization looks fancy, but while I looked deeply, the dataset need a serious cleaning and wrangling.



Raw data 原始数据用Excel表格打开的样子

dataset screenshot

The raw data was organized by report date, case locations, location type, data field/data category,  the field code, period, its types, value (how observations/cases), the unit.



While I plotted the cases by counties from 2015 to 2016, we could see most of Zika epidemic cases were observed much more in 2016 especially in South American countries. Colombia had by far the most reported Zika cases. Puerto Rico, New York, Florida and Virgin Islands of USA have reported Zika cases so far.  During this data recorded period 12 countries were reported had Zika virus cases, from most reported cases to the least these countries are: Colombia (86,889 reported cases), Dominican Republic (5,716), Brazil (4,253),  USA(2,962), Mexico (2894),  Argentina (2,091 ), Salvador (1,000), Ecuador(796), Guatemala (516), El   Panama(148) , Nicaragua (125) and Haiti (52). See the below map.

把原始数据按照记录直接用来作图的话就会发现Zika传染病被报道的案例从2015年到2016年有一个数量级的爆发。换句话来说就是2016年的数量比2015年要多很多(不过2015年的数据记录才从11月份开始,所以其实也不足以说明问题)。哥伦比亚这个国家Zika被报道的案例在2016年是全球最高的。美国的话也有近3000个案例被记录在案,其中波多黎各,纽约,佛罗里达和维京各岛屿相继都有Zika案例报道。从全球传播来看亚洲欧洲被报道的案例数没有被包括在这个数据之中,而有12个北美,中美和南美的国家被大量报道Zika病毒的传播。这12个国家和这些国家被记录的Zika案例数量从最高到最低来看分别是:哥伦比亚 (86889 报道案例),多米尼加共和国(5716),巴西(4253),美国(2962),墨西哥(2894),阿根廷(2091 ),萨尔瓦多(1000),厄瓜多尔(796),危地马拉(516),巴拿马(148),尼加拉瓜 (125)和海地(52)。请看一下地图。


However, while I went back to organize the reported Zika cases for each country, I found the data recorded for each country was not consistent. It’s oblivious that the each country has their strengths and different constraints for tracking Zika epidemic. Let’s see some examples:



In the states, most of the reported cases are from travel. But I am confused that aren’t the confirmed fever, eye pain, headache cases overlapped with zika reported, and zika_reported travel were included in yearly_reported_travel_cases. If so, were the cases were overestimated for most of the countries. Probably only CDC could explain the data much better from medical conditions and epidemic perspective.


From the reported cases that Microcephaly cases caused by Zika virus were only founded in Brazil and Dominic Republic.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed. The child’s brain stops growing as it should. People get infected with Zika through the bite of an infected Aedes species mosquito (Aedes aegypti and Aedes albopictus). A man with Zika can pass it to sex partners but there was a case that a woman who infected with Zika virus has been found passed the virus to her partner too.

从发生的Zika案例来看Zika病毒感染引起的小头症(Microcephaly )目前只有在多米尼加共和国和巴西这两个国家被确诊和报道过。小头症是一种病毒感染而阻止婴孩神经系统正常发育,而引起的不正常头部发育。小头症顾名思义就是婴孩脑子的发育比正常发育的头要小,婴孩的脑子停止发育造成的。所以准备怀孕和已经怀孕的妇女其实应该避免到这些国家履行。现在已经被报道Zika病毒除了通过蚊虫叮咬传播其实通过性交也是可以传播的。之前报道只发现感染病毒的男性通过性交会把病毒传给其女伴,但是最近有一个案例也说明感染病毒的女性同样也可以通过性交传播病毒给其男伴。

My original R codes could be accessed here; first gif animation graph was originally coded by a UK-based data scientist Rob Harrand, and I only edit the data presented interval and image resolution.

这也算是一个非常粗糙的分析,但是如果大家对我的原始分析程序感兴趣,请移步这里。这个博客中使用的动图原始程序是英国大数据分析师Rob Harrand做的,我只是改了他的参数还有生成的动态图的尺寸。当然除了动图之外其他程序都是我写的,如果有需要请注明出于geoyi.org.

Note: Again, this is an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic, because the raw data from CDC still need seriously cleaning. For more insight, please follow CDC’s reports and cases recorded.