Bike sharing system is a convenient and clean way to get around the cities through obtaining membership, rental and bike return. It’s getting popular among high populated cities globally.
(this picture is from http://kaggle.com)
While I was in Chicago for a business trip, my favourite activity was riding the rental bike around the Lake Michigan. In a bright autumn afternoon with few cloud and the air cools down, it’s definitely a perfect time to rent a bike to explore an unknown city as a tourist. However, it’s quite frustrated while millions of tourists are renting the bikes. It’s unpleasant if there are no bike left and also too many bikers sharing a tiny bike lane. To find a nice fall afternoon and not many people on the road make a perfect sense while you’re a tourist in an exciting city you wanna explore.
As a city bike sharing manager, you wanna share the as many bikes as with potential riders, and I am sure you will have such concerns:
- How many bikes are actually needed in the city bike sharing system? 我们的城市自行车租用系统到底需要多少自行车？
- If the bike demand varies every day according to the temp, weather, holiday, and humidity?每一天自行车的租用情况随着气温，天气，假期和空气湿度等等的变化是如何变化的？
It will be most cost-efficient that the city won’t provide too many bikes and it’s important not to run short.所以对于一个管理者来说，用经济的自行车租用数量最好的服务市民才是最重要的。
Here, I got the data from http://kaggle.com, it about 11,000 records in the dataset. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare. Capital Bikeshare is a bike sharing system in Washington DC that aims to rent a bike for people who are going to Metro, to work, run errands. It has more than 3000 bikes in the system for over 350 stations across Washington DC, Arlington and Alexandria, VA and Montgomery County, MD and it could be returned to any station near your destination. I have not used bikes in DC yet and might be worth to try, it’s free for first 30mins.
这个数据来源是http://kaggle.com，一共有两年11,000个自行车使用情况的记录。原始数据由华府（美国首都华盛顿特区）的华府自行车共享的Hadi Fanaee Tork提供。华府自行车共享系统是针对居民出行的需要（去地铁，去工作，购物）设置的。整个系统有350多个租用站3000多辆自行车，分别分布于华府，佛吉尼亚州的阿灵顿，压力山大港，以及马里兰州的蒙哥马利县各处。 虽然住在华府附近已久但是我自己还没有使用过这个系统里的自行车，听说最开始的30分钟是免费的—值得一试。
Some result from the data analysis
(1-Spring; 2-Summer, 3-Fall; and 4-Winter)
From the graph, we could tell that more people are using bikes during the fall, and least people are biking around during the spring time.从上图可以看出来秋季是人们最喜欢骑自行车但是春季是最不喜欢骑自行车的季节。
Through the year, bike demand starts to climb after Apri and decline after Oct. The demand pick is around Sep at least from 2 years data records.
When I replot the data to 24 hours for working days, from the midnight to 23:00, the pattern of bike demand could be seen as 1) while the temperature rises more people are on bike; 2) there are two peaks of bike demands in a work day, which is morning time around 8am and afternoon around 18am; 3) People like to use bike during the lunch time while the temperature is warmer than 20 degrees.
However, the bike demand looks a bit different when it was a holiday. The maximum of bike demand was not that high compares to the working days, which means residents in DC area are using bikes. The demand for the holiday is more spread out than work days, and it slowly starts after 8 am when the temperature is pleasant, and the demand peak appears around 13:00 to 17:00.
From the above graphs, you might find we only have dug out the bike demands, which is label as “count” in the dataset, together with temperature (mainly). If we wanna make a prediction of how many bikes we actually need for each day, and just imagine that any condition you don’t wanna ride a bike in DC. If I only speak for myself, I don’t wanna ride a bike: 1). When it’s too cold out there (Oops, topical people); 2) too humid; 3) too windy; 4) it’s rainy hard; 5) too many people out there riding bikes.
To make a prediction like a bikshare system manager, we need to know the correlation of each pair of variables, which the pair between each of humidity, weather, workingday, windspeed, hour, holiday and so on. Therefore, I produced a graph to pair out the correlation for each pair of variables. The blue colour represents positive correlation, and red colours mean negative correlation. For example, looking at the column of ‘count'(it’s the bike demand I mentioned above), it has positive correlation with temperature (‘temp’) in the graph, which means when temperature goes up people like to ride bike, but it has negative correlation with humidity (‘humidity in the graph’), which indicates people would not like to ride a bike during a high humidity time. Note: this is a linear regression, which means I just assume each pair of variables is linearly correlated, which could not actually reflect the reality sometimes. For example, I could not bike outside while it’s too hot, but the regression tells that people would love to bike even more while it’s actually hot (with positive correlation).
I ran the linear regression between bike demands and the variables above and had this blowing regression. It will be able to help us to predict how many bikes we actually need in The Capital Bikeshare system each day, according to the weather, holiday, and temperature, etc. As a tourist, you could also predict if you wanna go out today according to the weather prediction and rough prediction of how many bikes are going to be around in the city.
At this point we could make a prediction/assumption: Today, it’s fall now, and holiday; the weather is clear, few clouds; temp is 30, but air temp is about 34; humidity is about 70%; weed seed is about 2, and it’s close to 16:00 pm now. So we could predict how many bikes are needed for the particular hour, day and weather.
The answer is 781 bikes.
My R codes could be found here: http://rpubs.com/Geoyi/BikeshareDC_LM.
Statistics is quite useful, isn’t it?!