Data-driven​ city planning: Using multiple linear regression to predict bike sharing demands in Washington DC 数字化城市计划:多元线性回归帮您预测华府自行车租用情况

Bike sharing system is a convenient and clean way to get around the cities through obtaining membership, rental and bike return. It’s getting popular among high populated cities globally.

现在很多城市都已经开始使用自行车租用系统。自行车租用是通过购买会员,租车和还车的一个系统。

bikes

(this picture is from http://kaggle.com)

(图片来源于http://kaggle.com)

While I was in Chicago for a business trip, my favourite activity was riding the rental bike around the Lake Michigan. In a bright autumn afternoon with few cloud and the air cools down, it’s definitely a perfect time to rent a bike to explore an unknown city as a tourist. However, it’s quite frustrated while millions of tourists are renting the bikes. It’s unpleasant if there are no bike left and also too many bikers sharing a tiny bike lane. To find a nice fall afternoon and not many people on the road make a perfect sense while you’re a tourist in an exciting city you wanna explore.

我在芝加哥出差的时候最难忘的经历便是骑着自己租用的自行车环着密歇根湖环行。如果在午后空气凉爽的秋季骑上自行车到处逛逛你没有住过的城市,该是多么美好的一件事情啊。但是对于一个普通的只想探索一个未知城市的旅行者来说,过多的自行车骑客拥挤在一条小小的自行车道上也还是蛮郁闷的。所以对于游客来说找一个秋季凉爽的天气,没有多少其他行人出去骑车体验未知的城市那是多美好,对吧!

As a city bike sharing manager, you wanna share the as many bikes as with potential riders, and I am sure you will have such concerns:

但是对于一个自行车租用系统的管理者来说,他担忧的内容又完全和游客考虑的内容是不一样的。对于管理者他可能会有以下的顾虑:

  1. How many bikes are actually needed in the city bike sharing system? 我们的城市自行车租用系统到底需要多少自行车?
  2. If the bike demand varies every day according to the temp, weather, holiday, and humidity?每一天自行车的租用情况随着气温,天气,假期和空气湿度等等的变化是如何变化的?

It will be most cost-efficient that the city won’t provide too many bikes and it’s important not to run short.所以对于一个管理者来说,用经济的自行车租用数量最好的服务市民才是最重要的。

Picture1

Here, I got the data from http://kaggle.com, it about 11,000 records in the dataset. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. Capital Bikeshare is a bike sharing system in Washington DC that aims to rent a bike for people who are going to Metro, to work, run errands. It has more than 3000 bikes in the system for over 350 stations across Washington DC, Arlington and Alexandria, VA and Montgomery County, MD and it could be returned to any station near your destination. I have not used bikes in DC yet and might be worth to try, it’s free for first 30mins.

这个数据来源是http://kaggle.com,一共有两年11,000个自行车使用情况的记录。原始数据由华府(美国首都华盛顿特区)的华府自行车共享的Hadi Fanaee Tork提供。华府自行车共享系统是针对居民出行的需要(去地铁,去工作,购物)设置的。整个系统有350多个租用站3000多辆自行车,分别分布于华府,佛吉尼亚州的阿灵顿,压力山大港,以及马里兰州的蒙哥马利县各处。 虽然住在华府附近已久但是我自己还没有使用过这个系统里的自行车,听说最开始的30分钟是免费的—值得一试。

Map

Some result from the data analysis

初始数据分析结果

 

Rplot02

(1-Spring; 2-Summer, 3-Fall; and 4-Winter)

(图中1是指春季,2是指夏季,3是秋季,4是冬季)

From the graph, we could tell that more people are using bikes during the fall, and least people are biking around during the spring time.从上图可以看出来秋季是人们最喜欢骑自行车但是春季是最不喜欢骑自行车的季节。

Rplot01

Through the year, bike demand starts to climb after Apri and decline after Oct. The demand pick is around Sep at least from 2 years data records.

从一整年的情况来看人们最喜欢骑自行车的月份开始于4月然后到10月份就开始下降了。需求量的最高峰出现在9月份。

When I replot the data to 24 hours for working days, from the midnight to 23:00, the pattern of bike demand could be seen as 1) while the temperature rises more people are on bike; 2) there are two peaks of bike demands in a work day, which is morning time around 8am and afternoon around 18am; 3) People like to use bike during the lunch time while the temperature is warmer than 20 degrees.

如果把自行车的需求量按照一天24小时来作图,这个提取的数据是工作日的自行车需求量。那么从下图我们可以分析出一些规律:1)气温升高的时候骑自行车的人也变多了;2)在工作日的24小时里头早上和下午出现两个使用自行车的高峰期,不高气温高低;3)在午餐休憩期间也有不少人使用自行车呢,特别是温度高于20度之后使用的人似乎更多。

Rplot03

However, the bike demand looks a bit different when it was a holiday. The maximum of bike demand was not that high compares to the working days, which means residents in DC area are using bikes. The demand for the holiday is more spread out than work days, and it slowly starts after 8 am when the temperature is pleasant, and the demand peak appears around 13:00 to 17:00.

但是在假期的时候自行车的使用情况和工作日还是有所不同的。至少从需求量来看假期的自行车需求在最高峰的时候没有工作日多,但是高峰期更宽时间跨度更长。这个高峰期主要集中出现在下午1点到5点之间。

Rplot04

From the above graphs, you might find we only have dug out the bike demands, which is label as “count” in the dataset, together with temperature (mainly). If we wanna make a prediction of how many bikes we actually need for each day, and just imagine that any condition you don’t wanna ride a bike in DC. If I only speak for myself, I don’t wanna ride a bike: 1). When it’s too cold out there (Oops, topical people); 2) too humid; 3) too windy; 4) it’s rainy hard; 5) too many people out there riding bikes.

上面几个图中我们只是观察了自行车需求量和气温的关系。但是对于现实情况来看自行车需求量其实不只是和气温有关系。从我个人角度来讲,以下情况下我就不可能在外头骑自行车:1)外面太冷 (热带人们怕冷);2)外头湿度太大,黏答答的有没有?!;3)风太大(毕竟人比较瘦,嘻嘻);4)雨下得太大了;5)其实汽车人太多我也不喜欢哎~

To make a prediction like a bikshare system manager, we need to know the correlation of each pair of variables, which the pair between each of humidity, weather, workingday, windspeed, hour, holiday and so on. Therefore, I  produced a graph to pair out the correlation for each pair of variables. The blue colour represents positive correlation, and red colours mean negative correlation. For example, looking at the column of ‘count'(it’s the bike demand I mentioned above), it has positive correlation with temperature (‘temp’) in the graph, which means when temperature goes up people like to ride bike, but it has negative correlation with humidity (‘humidity in the graph’), which indicates people would not like to ride a bike during a high humidity time. Note: this is a linear regression, which means I just assume each pair of variables is linearly correlated, which could not actually reflect the reality sometimes. For example, I could not bike outside while it’s too hot, but the regression tells that  people would love to bike even more while it’s actually hot (with positive correlation).

那么要像一个自行车租用管理者一样思考,我们就要知道我提的以上的变量彼此之间怎么互相影响,对吧。所以我又做了一个图,对比两两变量之间的关系是怎么样的。下图的蓝色代表的是正相关,红色系表示的是负相关。我们就可以看,自行车的需求量(图中的“count”)那一栏对到humidity(空气湿度)的饼图就发现他们其实就是负相关,意思就是如果外面湿度太大在外面汽车的人就越少,反过来说就是这时候自行车的需求量就小了。那对着看count 和temp(温度)的关系就发现他们是正相关,正相关的意思就是温度越高我就越爱在外头骑自行车,所以对于一个城市来说自行车的需求量就高了。当然我们这里做的是线性回归。线性回归的意思就是,我们都是直来直去的关系不拐弯抹角。但是其实这不能反映现实情况,比如温度直线上升我怎么会喜欢在外头骑自行车呢,对吧?!但是可能华府它气温就不可能太高,或者说气温高的天数太少了从真个数据(样本)来看对整体不构成影响。

Rplot05

I ran the linear regression between bike demands and the variables above and had this blowing regression. It will be able to help us to predict how many bikes we actually need in The Capital Bikeshare system each day, according to the weather, holiday, and temperature, etc. As a tourist, you could also predict if you wanna go out today according to the weather prediction and rough prediction of how many bikes are going to be around in the city.

从下头的多元线性回归中,我们便可以依据每天天气情况,是否是假期等等因素来考虑每天自行车的需求量,这也就是一个自行车租用系统管理者关心的。但是反过来,作为游客我们也可以依照天气预报大概估算一下今天在街上租用自行车的人大概有多少人,如果喜欢热闹就选在人相对多的时候出门如果不喜欢热闹怕吵那就在人少的时候出门。

linear regression

At this point we could make a prediction/assumption: Today, it’s fall now, and holiday; the weather is clear, few clouds; temp is 30, but air temp is about 34; humidity is about 70%; weed seed is about 2, and it’s close to 16:00 pm now. So we could predict how many bikes are needed for the particular hour, day and weather.

The answer is 781 bikes.

My R codes could be found here: http://rpubs.com/Geoyi/BikeshareDC_LM.

到这个点上,我们就可以大概预测:今天是秋天里气候凉爽,少云;气温在30度左右,湿度为70%,风速不大大概在2左右,然后现在快要下午四点了,而且还是不用工作的假期。那么从上么的公式我们就可以大概预测,今天在周围活动的自行车大概是781辆。

我的数据分析R程序代码在这里:http://rpubs.com/Geoyi/BikeshareDC_LM。

Statistics is quite useful, isn’t it?!

统计是不是很有用呢?!

Global Zika virus epidemic from 2015 to 2016: A big data problem- 大数据分析全球Zika病毒传染

Centers for Disease Control and Prevention (CDC) provided Zika virus epidemic from 2015 to 2016,  about 107250 observed cases globally, to kaggle.com. Kaggle is a platform that data scientists compete on data cleaning, wrangling, analysis and provide the best solution for big data problems.

美国疾病传染防控中心 (CDC) 给大数据分析师们提供了一个记录有十多万个全球Zika病毒传染案例。这个数据传到了Kaggle网站上,Kaggle网站是一个大数据分析比赛和数据共享平台。

Zika virus epidemic problem is an interesting problem, so I took the challenge and coded an analysis in RStudio.  However, after finishing a rough analysis, I found that this could be an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic. Because the raw data has not been cleaned and clarified yet, and the raw data description could be seen here.

我觉得这个挑战还蛮有意思的,所以也下载了数据来分析看看。这个博客里头提供的是我初始分析的一些结果。但是必须提前申明的一点是:由于CDC提供的原始数据本身还是满粗糙也有很多记录不明晰的地方,所以我的这个分析以其说是一个解决方案不如说是一个纯粹的大数据分析案例。

A bit of background of Zika and Zika virus epidemic from CDC.

  • Zika is spread mostly by the bite of an infected Aedes species mosquito (Ae. aegypti and Ae. albopictus). These mosquitoes are aggressive daytime biters. They can also bite at night.
  • Zika can be passed from a pregnant woman to her fetus. Infection during pregnancy can cause certain birth defects, e.g. Microcephaly.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed.
  • There is no vaccine or medicine for Zika yet.

关于Zika和Zika病毒传染的一些背景知识:

  • Zika由通过Aedes蚊虫叮咬传播(主要是该蚊子的两个分种:Ae. aegypti 和Ae. albopictus 传播)。该蚊虫叮咬主要发生在白天,当然也会发生在晚上。
  • Zika的危险之处是病毒可以通过怀孕的母亲传给其腹中的婴孩。病毒可以影响胎儿正常的神经发育而引起生育缺陷,包括现在被发现和报道的小头症。
  • 目前可预防Zika的药物和预防针还没有。

Initiative outputs from the data analysis 初始的分析结果

Firstly, let see the animation of the Zika virus observations globally. The cases observations were started recorded from Nov. 2015 to July 2016. At least from the documented cases during the period, it started from Mexico and El Salvador, and it spread to South American countries and the USA. The gif animation makes the data visualization looks fancy, but while I looked deeply, the dataset need a serious cleaning and wrangling.

CDC提供的数据采集于2015年11月到2016年7月份之间。从下图动画中可以看出这段时间之内Zika的传播是从墨西哥和萨尔瓦多两个国家开始传播的。虽然这个动图让传染病从一个国家到另一个国家的传播速度更为明了,但是其实仔细看下来CDC提供的这个原始的数据却还是需要特别清理的。换句话来说就是数据采集,和记录挺混乱。

Zika_ani.gif

Raw data 原始数据用Excel表格打开的样子

dataset screenshot

The raw data was organized by report date, case locations, location type, data field/data category,  the field code, period, its types, value (how observations/cases), the unit.

原始数据的记录记录是每一个Zika案列发生的时间,地点,地点类型(是区域还是省级的),案例类型,类型代码,发生的时段,发生的类型,以及案列数等等。

Rplot

While I plotted the cases by counties from 2015 to 2016, we could see most of Zika epidemic cases were observed much more in 2016 especially in South American countries. Colombia had by far the most reported Zika cases. Puerto Rico, New York, Florida and Virgin Islands of USA have reported Zika cases so far.  During this data recorded period 12 countries were reported had Zika virus cases, from most reported cases to the least these countries are: Colombia (86,889 reported cases), Dominican Republic (5,716), Brazil (4,253),  USA(2,962), Mexico (2894),  Argentina (2,091 ), Salvador (1,000), Ecuador(796), Guatemala (516), El   Panama(148) , Nicaragua (125) and Haiti (52). See the below map.

把原始数据按照记录直接用来作图的话就会发现Zika传染病被报道的案例从2015年到2016年有一个数量级的爆发。换句话来说就是2016年的数量比2015年要多很多(不过2015年的数据记录才从11月份开始,所以其实也不足以说明问题)。哥伦比亚这个国家Zika被报道的案例在2016年是全球最高的。美国的话也有近3000个案例被记录在案,其中波多黎各,纽约,佛罗里达和维京各岛屿相继都有Zika案例报道。从全球传播来看亚洲欧洲被报道的案例数没有被包括在这个数据之中,而有12个北美,中美和南美的国家被大量报道Zika病毒的传播。这12个国家和这些国家被记录的Zika案例数量从最高到最低来看分别是:哥伦比亚 (86889 报道案例),多米尼加共和国(5716),巴西(4253),美国(2962),墨西哥(2894),阿根廷(2091 ),萨尔瓦多(1000),厄瓜多尔(796),危地马拉(516),巴拿马(148),尼加拉瓜 (125)和海地(52)。请看一下地图。

Rplot01

However, while I went back to organize the reported Zika cases for each country, I found the data recorded for each country was not consistent. It’s oblivious that the each country has their strengths and different constraints for tracking Zika epidemic. Let’s see some examples:

所以我接下来想要看的就是每个国家记录的Zika案列都可以怎么分类。但是其实从下图就可以看出来每个国家对于案例的追踪和记录还是有所差别的,可能和每个国家负责记录数据,追踪案例的机构都不同有关系。大家可以通过以下各图来了解一个究竟:

Rplot14Rplot13Rplot12Rplot11Rplot10Rplot09Rplot08Rplot07Rplot06Rplot05Rplot04Rplot03Rplot02

In the states, most of the reported cases are from travel. But I am confused that aren’t the confirmed fever, eye pain, headache cases overlapped with zika reported, and zika_reported travel were included in yearly_reported_travel_cases. If so, were the cases were overestimated for most of the countries. Probably only CDC could explain the data much better from medical conditions and epidemic perspective.

就比如在美国被报道最多的案例类型中,其实是旅游相关的,就是病毒传染者去过病毒传播比较猖狂的国家。但是数据记录类型来看有症状相关的记录比如确定发烧,眼睛疼和头疼的案列,难道这些案列不是和已经怀疑或者的确诊的案列是重合的吗?难道眼睛疼和发烧是两个独立的案例和症状?所以有此就可以看出CDC提供的原始数据本身在分析之前是需要好好的理解也需要好好的清理一下的。或者数据记录都正确,但很多让人不解的地方似乎也只有CDC自己出来解释了。

From the reported cases that Microcephaly cases caused by Zika virus were only founded in Brazil and Dominic Republic.  Microcephaly is a rare nervous system disorder that causes a baby’s head to be small and not fully developed. The child’s brain stops growing as it should. People get infected with Zika through the bite of an infected Aedes species mosquito (Aedes aegypti and Aedes albopictus). A man with Zika can pass it to sex partners but there was a case that a woman who infected with Zika virus has been found passed the virus to her partner too.

从发生的Zika案例来看Zika病毒感染引起的小头症(Microcephaly )目前只有在多米尼加共和国和巴西这两个国家被确诊和报道过。小头症是一种病毒感染而阻止婴孩神经系统正常发育,而引起的不正常头部发育。小头症顾名思义就是婴孩脑子的发育比正常发育的头要小,婴孩的脑子停止发育造成的。所以准备怀孕和已经怀孕的妇女其实应该避免到这些国家履行。现在已经被报道Zika病毒除了通过蚊虫叮咬传播其实通过性交也是可以传播的。之前报道只发现感染病毒的男性通过性交会把病毒传给其女伴,但是最近有一个案例也说明感染病毒的女性同样也可以通过性交传播病毒给其男伴。

My original R codes could be accessed here; first gif animation graph was originally coded by a UK-based data scientist Rob Harrand, and I only edit the data presented interval and image resolution.

这也算是一个非常粗糙的分析,但是如果大家对我的原始分析程序感兴趣,请移步这里。这个博客中使用的动图原始程序是英国大数据分析师Rob Harrand做的,我只是改了他的参数还有生成的动态图的尺寸。当然除了动图之外其他程序都是我写的,如果有需要请注明出于geoyi.org.

Note: Again, this is an example of big data analysis instead of a perfect example for CDC on Zika virus epidemic, because the raw data from CDC still need seriously cleaning. For more insight, please follow CDC’s reports and cases recorded.

注明:再一次重申这个大数据分析以其说是给CDC做的完整的分析不如说是一个纯粹的大数据分析案例。因为大家可以看到其实这个原始数据是需要特别清理的,而且部分数据应该只有CDC他们自己才能够解释清楚的。如果大家感兴趣可以去看看CDC相继的报道以及数据记录。

 

PV Solar Installation in US: who has installed solar panel and who will be the next?

Project idea

Photovoltaic (PV) solar panels, which convert solar energy into electricity, are one of the most attractive options for the homeowners. Studies have shown that by 2015, there are about 4.8million homeowners had installed solar panels in the United States of America. Meanwhile, the solar energy market continues growing rapidly. Indeed, the estimated cost and potential saving of solar is the most concerned question. However, there is a tremendous commercial potential for the solar energy business, and visualizing the long term tendency of the market is vital for the solar energy companies’ survival in the market . The visualization process could be realized by examining the following aspects:

  1. Who has installed PV panels, and what are the characteristics of the household, e.g. what’s the age, household income, education level, current utility rate, race, home location, current PV resource, existing incentive and tax credits for those that have installed PV panels?
  2. What does the pattern of solar panel installation looks like across the nation, and at what rate? Which household is the most likely to install solar panels in the future?

The expected primary output from this proposal is a web map application . It will contain two major functions. The first is the cost and returned benefit for the households according to their home geolocation. The second is interactive maps for the companies of the geolocations of their future customers and the growth trends.

Initial outputs


The cost and payback period for the PV solar installation: Why not go solar!

NetCost

Incentive programs and tax credits bring down the cost of solar panel installation. This is the average costs for each state.

Monthly Saving

Going solar would save homeowners’ spending on the electricity bill.

Payback Years

Payback years vary from state to state, depending on incentives and costs. High cost does not necessarily mean a longer payback period because it also depends on the state’s current electricity rate and state subsidy/incentive schemes. The higher the current electricity rate, the sooner you would recoup the costs of solar panel installation. The higher the incentives from the state, the sooner you will recoup the installation cost.

How many PV panels have been installed and where?

Number of Solar Installation

The number of solar panels installed in the states that have been registered on NREL’s Open PV Project. There were about 500,000 installations I was able to collect from the Open PV Project. It’s zip-code-based data, so I’ve been able to merge it to the “zip code” package on R. My R codes file is added here at my GitHub project.

Other statistical facts : American homeowners who installed solar panels generally has $25,301.5higher household income compare to the national household income. Their home located in places that have higher electricity rate, about 4 cents/kW greater than the national average, and they are also having higher solar energy resource, about 1.42 kW/m2 higher than the national average.

Two interactive maps were produced in RStudio with “leaflet”

Solar Installation_screen shot1

An overview of the solar panel installation in the United States.

Solar Installation_screen shot2

Residents on the West Coast have installed about 32,000 solar panels from the data registered on the Open PV Project, and most of them were installed by residents in California. When zoomed in closely, one could easily browse through the details of the installation locations around San Francisco.

Solar Installation_screen shot3

Another good location would be The District of Columbia (Washington D.C.) area. The East Coast has less solar energy resource (kW/m2) compared to the West Coast, especially California. However, the solar panel installations of homeowners around DC area are very high too. From maps above, we know that because the cost of installation is much lower, and the payback period is much faster compared to other parts of the country. It would be fascinating to dig out more information/factors behind their installation motivation. We could zoom in too much more detailed locations for each installation on this interactive map.

However, some areas, like DC and San Francisco, have a much larger population compared to other parts of US, which means there are going to be much more installations. An installation rate per 10,000 people would be much more appropriate. Therefore, I produced another interactive map with the installation rate per 10,000 people, the bigger the size of the circle is the higher rate of the installation.

Solar Installation_screen shot4

The largest installation rate in the country is in the city of Ladera Ranch, located in South Orange County, California. Though, the reason behind it is not clear and more analysis is needed.

Solar Installation_screen shot5

Buckland, MA has the highest installation on the East Coast. I can’t explain what the motivation behind it yet either. Further analysis of the household characteristics would be helpful. These two interactive maps were uploaded tomy GitHub repository, where you will be able to see the R code I wrote to process the data as well.

Public Data Sources

To answer these two questions, datasets of 1670M (1.67G) were downloaded and scraped from multiple sources:
(1). Electricity rate by zip codes;

(2). A 10km resolution of solar energy resources map, in ESRI shapefile, was downloaded the National Renewable Energy Laboratory (NREL); It was later extracted by zipcode polygon downloaded from ESRI ArcGIS online.

(3). Current solar panel installation data was scraped from the website of open PV website, a collection of installations by zip code. It requires registration to be able to access the data. It is part of NREL. The dataset includes the zip code of the installation, the cost, the size of the installation and the state of each location.

(4). Household income, education, the population of each zip code was obtained from US census.

(5). The average cost of the solar installation for each state was scraped from the website: Current cost of solar panels and Why Solar Energy? More of datasets for this proposal will be downloaded from the Department of Energy on GitHub via API.

Note: I cannot guarantee the accuracy of the analysis. My results are based on two days of data mining, wrangling, and analysis. The quality of the analysis is highly depended on the quality of the data and on how I understood the datasets in such limited time. A further validation of the analysis and datasets is needed.

For further contact the author, please find me on https://geoyi.org; or email me:geospatialanalystyi@gmail.com.

Finally got my GitHub account and some other useful resources for RStudio for Git, GitHub

I finally got my portfolio ready for data science and GIS specialist job searching. Many of friends in data science have suggested that having a GitHub account available would be helpful. GitHub is a site that holds and manages codes for programmers globally. GitHub works much better if your have your colleagues work on the same programming with you, it will help to track the codes editing from other people’s contribution to the programming/project.

I’ve started to host some of the codes I developed in the past on my GitHub account. I use R and Python for data analysis and data visualization; Python for mapping and GIS work. HTML, CSS and Javascript for web application development. I’ve always been curious that how other people’s readme file look much better than my own. BTW, Readme file is helping other programmer read your file and codes easier.  Some of my big data friends also share this super helpful site that teaches you how to use Git link R, R markdown with RStudio to GitHub step by step.  It’s very easy to understand.

Anyway, shot me an email to geospatialanalystyi@gmail.com if you need any other instruction on it.

github

Find out your survival rate in Titanic tragedy

I believe all of us have been watched the movie Titanic by James Cameron (1997) again and after a good sobbing, let find out if we all could survival through the Titanic. Actually, Titanic dataset is also a superstar dataset in data science that people use to do all sort of crazy survival machine learning. Today we are going to use R to answer who actually survived and what their age, sex, and social status.

The sinking of the RMS Titanic occurred on the night of 14 April through to the morning of 15 April 1912 in the north Atlantic Ocean, four days into the ship’s maiden voyage from Southampton to New York City.

titanic boat

(image from google)

  1. What is in the dataset.

We have 1308 passengers in the data. The data includes:

survival Survival (0 = No; 1 = Yes);

pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd);

name Name;

sex Sex;

age Age;

sibsp Number of Siblings/Spouses Aboard;

parch Number of Parents/Children Aboard;

ticket Ticket Number;

fare Passenger Fare;

cabin Cabin;

embarked Port of Embarkation; (C = Cherbourg; Q = Queenstown; S = Southampton).

titanic dataset

How the dataset looks like.

2. Running R and packages.

I have uploaded my R codes to my GitHub account, find my R codes on GitHub.

3. Results.

Rplot01

This graph shows you who are on Titanic, there were more male passengers than female especially for the third class.

Rplot02

This is a graph show the survival comparison. Left graph shows people who did not survive and right graph show the survival counts (how many people survived). The death rate for third class passengers was super high :-(. Female passengers had high survival rate, especially for the first class.

Rplot03

This is also a death and survival comparison but with the age element (y-axis). From who were the survivals question you could see, the female had the highest survival rate overall, but for third class female tended to be much younger to be able to survive the tragedy. Now you know why Jack did not survive in the movie Titanic wasn’t a just tragedy itself, but it also there was the higher risk for him to lose his life in the voyage sinking.

Data visualization is very straight forward, isn’t it.  Here is a TED talk ‘The beauty of data visualization’ by David McCandless I found.  It’s really inspiring if you guys every interested in data visualization.

A baby step towards my interactive map application using Leaflet JavaScript

Picture1

This is how my national GDP interactive map looks like on the local host. You could watch my first ever video record on YouTube of this interactive map. A brief introduction of the map and also the codes using HTML, CSS, and JavaScript. This was a very simple example I made to test some of my ideas.

If you remember my last blog that I present an interactive map host via ESRI ArcGIS Online.  After my data was successfully uploaded, I found several issues that I don’t like about it:

  1. Even though ESRI ArcGIS Online have a super nice format that you could visualize the spatial data in a pretty way, but the data loading from the site is very slow, AND IT’S COULD BE VERY EXPENSIVE. I am at my 60 days free trial at this point and I believe if I wanna use the server and do some data analysis on ArcGIS Online I have to buy their credits;
  2. The way of data presenting is restricted to the certain format depends on how you select the web map format from ESRI.

I use quite a bit of R, and I know that there are two packages in R called Shiny and Leaflet For R might help me develop the idea. I was so thrilled to find these packages, I feel a bright light shine on my road and point to the destination I wanna head to, and I found a perfect example that my web map application will look like especially the case of  American Super Zipcode. There are not only an interactive map but also while you zoom in and out you could also show some statistic results on the right side of the map. It’s too cool.

But I was so disappointed too while I found out developing a web application through Shiny and Leaflet for R would not be free, because I still need a server to host my data and APP once they could be share. However, at the point that I only need to test my ideas.

I gave up the two methods I found above and even checked out Mapbox Studio and Cartodb, two of the most popular online interactive map and visualization platform. But they are for developers (you still could use it without coding background, though), but I wanna have some features that require coding in Javascript. Leaflet JavaScript library is the last and best way I could use, which could give me enough freedom to figure out the functions/features for my application, and even the interactive analytical tools that I could put up over there. Now I also find D3 might be even more attractive because it hosts a bigger JS library that not only for the interactive map but also other online interactive way of data visualization.

I got a lot help from briefing through some YouTube videos (that’s the reason I recorded a video myself and hope it could be helpful to another struggling beginner like myself). Learn quite a lot of new things like GeoJson and GeoJson-vt. GeoJson is a geodata format for JavaScript, which is equal to shapefile for ArcGIS and QGIS. If your dataset is bigger than 1 M, the data loading to your website would slow down, so the founder of leaflet JS library wrote a vector tile JS codes (GeoJson-vt) to speed up the shapefile data loading process.

Here is my HTML, CSS and JavaScript code for the application you see in the video, You could also find me and my codes on GitHub

 

My web map application is online

Hi friends,

I’ve been working on a web application for Chinese Ministry of Commerce on rubber cultivation and risks will be out soon, and I just wanna share with you the simplified version web map API here. I only have layers here, though, more to come.

Screen shot for web map application

Web map application by Zhuangfang Yi: Current rubber cultivation area (ha) in tropical Asia

This web map API aims to tell the investors that rubber cultivation is not just about clearing the land/forests, plant trees and then you could wait for tapping the tree and sell the latex. There are way more risks for the planting/cultivate rubber trees, including several natural disasters, cultural and economic conflicts between the foreign investors and host countries.

We also found the minimum price for rubber latex for livelihood sustainability is as high as 3USD/kg. I define the  minimum price is the price that an investor/household could cover the costs of establishing and managing their rubber plantations. While the actual rubber price is lower than the minimum price, there is no profit for having the rubber plantations. The minimum price for running a rubber plantation varies from country to country. I ran the analysis through 8 countries in Asia: China, Laos, Myanmar, Cambodia, Vietnam, Malaysia and Indonesia. The minimum price depends on the minimum wage, labour availability, costs of the plantation establishments and management, average rubber latex productivity throughout the life span of rubber trees. The cut-off price ranges from 1.2USD/kg to 3.6USD/kg.

We could make an example that if rubber price is 2USD/kg now in the market, the country whose cutoff price for rubber is 3USD/kg won’t make any profit, but the investors in the country might lose at least 1USD/kg for selling every kg of rubber latex.

 

Why”#” is important? Data mining and streaming from twitter using Python

To be able to exact big data from twitter, you have to register an API for twitter.

I installed Python3.5 and edit my Windows8.1 environmental variables setting from ‘advance computer system setting. I downloaded Tweepy (exacting data from twitter using python), and the tweepy could not be installed in my computer Command Prompt. It reminded me that I have to log in my computer as the administrator to be able to install tweepy. Of course, right?! Sometime you just lose the battle by doing something not very smart. I relogged in my computer as the administrator and problem solved.


Marco Bonzanini has written a full 7 blogs about how to do data mining from twitter if you ever interested in doing big data analysis.

The natural rubber value chain and foreign investments in Thailand: how can we achieve sustainable and responsible rubber cultivation and investment?

I have an opportunity worked for Chinese Ministry of Commerce with ICRAF last fall, and have been studying natural rubber value chain since then. I led four technic reports on natural rubber value chain: the first report is for Thailand natural rubber value chain (please see the title);the second one  is about natural rubber value chain, foreign investments and land conflicts in Cambodia; the third report is the a comparison study between Thailand and Cambodia, the biggest natural rubber producer and the emerging rubber producer; the last report will concentrate on the risks of natural rubber cultivation and investment in Asia, from geosnatially perspectives. As I mentioned in the reports that there are no winner in the natural rubber value chain: we lost biodiversity and ecosystem services from covering natural forests to rubber monoculture (upstream of the value chain); and emitted million tons of polluted air and water, and carbon dioxide back to nature from rubber processing (the midstream); at the end, without sustainable livelihood for the poor who grows rubber; and limited competitiveness in the end products market (the downstream). We should go back the source and really think about how we can improve the whole value chain, and why.

The following content is the abstract of Thailand report in English. These reports are in Chinese recently, if you are interested in the content please contact Dr. Zhuang-Fang Yi, geospatialanalystyi@gmail.com and yizhuangfang@mail.kib.ac.cn.

Upper Mekong Region

Figure 1. The great Mekong region and also the global nature rubber producers. 

Asia supplies 93% of natural rubber demand globally. As the world No.1 natural rubber producer, Thailand has exported nearly 40% of global rubber production demands, which is 87% of its domestic rubber production. The production improvement in Thailand is not only depending on its biophysical suitability of rubber growing, but also relying on its policy supports and subsidies to millions of upstream rubber farmers. Thailand has spent about 21.3billion Baht (586million USD) from Sep. 2013 to Mar. 2014 to subsidize its rubber farmers while the price of natural rubber went down. However, lack of manufacturing and financial supports for its midstream and downstream of the natural rubber value chain, Thailand highly depends on rubber exporting to other countries, e.g. China, US, EU and Japan.

The long history of natural rubber cultivation and supports from Thai government has grown Thai rubber farmers a better rubber economic resilience cultivation systems, which is rubber agroforestry. Rubber agroforestry is a rather complex intercropping system compare to rubber monoculture. Rubber monoculture refers to the rubber plantations that only have rubber trees, and other plant species has been killed and get rids constantly by using herbicide and manual clearance. Rubber agroforestry sustains better ecosystem services and also bring more economic returns. But the labour requirement and knowledge gaps from rubber monoculture to rubber agroforestry are the main constrains for a greener cultivation system. It means rubber farmers only need to intensively take care rubber trees in rubber monoculture system, but need other knowledge and time inputs for rubber agroforestry. However, there are about 21 intercropping systems and more than 300 farms are practicing the intercropped rubber agroforestry by the rubber famers without authority supports like rubber monoculture in Thailand. Urgent research and institution support are need for rubber agroforestry in Thailand and globally.

The merging economies and natural rubber producer countries, e.g. Vietnam, Cambodia, Laos, and Myanmar in Mekong region, are following Thailand’s foot steps, only practicing rubber monoculture, that highly support its upstream value chain but lack of rubber manufacturing and supporting financing systems for mid-stream and downstream. It leads to heavily depend on Chinese and the rest of world rubber demands. It leads to very weak economic resilience for millions of smallholding rubber farmers when the price goes down. In China market, rubber price dropped from 6.3USD/kg to less than a dollar in 2014. China, as the biggest natural rubber importer, consuming nearly 40% of global rubber supply. On the other hand, 20% of imported taxes are charged and have dramatically increased the cost of rubber end products, and loss its global competitiveness in the natural rubber market. There are no winner in the natural rubber value chain: we lost biodiversity and ecosystem services from covering natural forests to rubber monoculture (upstream of the value chain); and emitted million tons of polluted air and water, and carbon dioxide back to nature from rubber processing (the midstream); at the end, without sustainable livelihood for the poor who grows rubber; and limited competitiveness in the end products market (the downstream). We should go back the source and really think about how we can improve the whole value chain, and why.

While more and more Chinese state-owned and private enterprises follow “Go Global” strategy by Chine central government who have heavily invested outside of China. Natural rubber end products, especially tires industry is one of them. In this reports, we scrutinized the natural rubber value chain in Thailand and its foreign investments , especially Chinese investments. We tried to answer:

  1. If there are the best rubber cultivation systems that combine economic returns and a better ecosystem services supporting system;
  2. The relationship between Chinese investors and Thai natural rubber value chain;
  3. The possible ways of sustainable and responsible rubber cultivation and investment.

Coming reports in Chinese

泰国橡胶种植面积.jpg

Figure 2. Thailand as the biggest rubber producer, produce 4.5millions ton of natural rubber, and 80% of Thailand domestic natural rubber is from Southern Thailand. Each polygon represents of a province in the map and the darker of the color represents the bigger area of rubber cultivation.

Esri technical certification: preparation tips

I find this is super helpful!!!

Alex Tereshenkov

Over last two years I have passed a bunch of Esri technical certification exams (10 exams to be precise). Esri did a great job posting multiple blog posts regarding how to prepare for the certification: there is a free training seminar covering the details and some blog posts and Esri Australia blog post. Esri has also authored two special instructor-led courses for those who plan to become certified as ArcGIS Desktop Associate and ArcGIS Desktop Professional:
Esri Technical Certification: Skills Review for ArcGIS Desktop Professional
Esri Technical Certification: Skills Review for ArcGIS Desktop Associate

Another useful resources are free web courses (2 for Deskop and 1 for Server):
Esri Technical Certification: Sample Questions for ArcGIS Desktop Associate

Esri Technical Certification: Sample Questions for ArcGIS Desktop Professional

Esri Technical Certification: Sample Questions for Enterprise Administration Associate

I highly recommend going through sample questions to get an idea of what…

View original post 840 more words