본문 바로가기
ML | 데이터과학/머신러닝

[Kaggle] Biking Sharing Demand 데이터분석 연습

by 노아론 2018. 5. 24.
test
In [ ]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline

plt.style.use('ggplot')
mpl.rcParams['axes.unicode_minus'] = False
In [10]:
train = pd.read_csv('train.csv', parse_dates=['datetime'])
train.shape
Out[10]:
(10886, 12)
In [11]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null datetime64[ns]
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.6 KB
In [14]:
train.head(20)
Out[14]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1
5 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75 6.0032 0 1 1
6 2011-01-01 06:00:00 1 0 0 1 9.02 13.635 80 0.0000 2 0 2
7 2011-01-01 07:00:00 1 0 0 1 8.20 12.880 86 0.0000 1 2 3
8 2011-01-01 08:00:00 1 0 0 1 9.84 14.395 75 0.0000 1 7 8
9 2011-01-01 09:00:00 1 0 0 1 13.12 17.425 76 0.0000 8 6 14
10 2011-01-01 10:00:00 1 0 0 1 15.58 19.695 76 16.9979 12 24 36
11 2011-01-01 11:00:00 1 0 0 1 14.76 16.665 81 19.0012 26 30 56
12 2011-01-01 12:00:00 1 0 0 1 17.22 21.210 77 19.0012 29 55 84
13 2011-01-01 13:00:00 1 0 0 2 18.86 22.725 72 19.9995 47 47 94
14 2011-01-01 14:00:00 1 0 0 2 18.86 22.725 72 19.0012 35 71 106
15 2011-01-01 15:00:00 1 0 0 2 18.04 21.970 77 19.9995 40 70 110
16 2011-01-01 16:00:00 1 0 0 2 17.22 21.210 82 19.9995 41 52 93
17 2011-01-01 17:00:00 1 0 0 2 18.04 21.970 82 19.0012 15 52 67
18 2011-01-01 18:00:00 1 0 0 3 17.22 21.210 88 16.9979 9 26 35
19 2011-01-01 19:00:00 1 0 0 3 17.22 21.210 88 16.9979 6 31 37
In [13]:
train.temp.describe()
Out[13]:
count    10886.00000
mean        20.23086
std          7.79159
min          0.82000
25%         13.94000
50%         20.50000
75%         26.24000
max         41.00000
Name: temp, dtype: float64
In [17]:
train.isnull().sum()
Out[17]:
datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64
In [58]:
corrMatt = train[["temp", "atemp", "casual","registered","humidity","windspeed","count"]]
corrMatt = corrMatt.corr()
print(corrMatt)

mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
                temp     atemp    casual  registered  humidity  windspeed  \
temp        1.000000  0.984948  0.467097    0.318571 -0.064949  -0.017852   
atemp       0.984948  1.000000  0.462067    0.314635 -0.043536  -0.057473   
casual      0.467097  0.462067  1.000000    0.497250 -0.348187   0.092276   
registered  0.318571  0.314635  0.497250    1.000000 -0.265458   0.091052   
humidity   -0.064949 -0.043536 -0.348187   -0.265458  1.000000  -0.318607   
windspeed  -0.017852 -0.057473  0.092276    0.091052 -0.318607   1.000000   
count       0.394454  0.389784  0.690414    0.970948 -0.317371   0.101369   

               count  
temp        0.394454  
atemp       0.389784  
casual      0.690414  
registered  0.970948  
humidity   -0.317371  
windspeed   0.101369  
count       1.000000  
In [64]:
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corrMatt, mask=mask, vmax=.8, square=True, annot=True)
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x241d8f77588>

'ML | 데이터과학 > 머신러닝' 카테고리의 다른 글

Cost Function for Logistic regression  (0) 2018.01.26
Linear regrssion, cost func. , Logistic  (0) 2018.01.26

댓글