산업데이터과학

IDS_2주차: Data Visualization

엉덩이싸움 2024. 10. 10. 21:37

EDA, Explorative Data Analysis

: 다양한 차원과 값을 조합해가며 특이점이나 의미있는 사실을 도출하고 분석의 최종 목적을 달성해나가는 과정

 

1) Verify expected relationships

2) Find some Unexpected structure in the data

3) Deliver data-driven insights in the right questions and not bias the investigation

4) Provide the context around the problem

Main topic in EDA: 저항성의 강조 / 잔차계산 / 자료변수의 재표현 / 그래프를 통한 현시성 제공

 

 

 Graphs

Graphs are used for data exploration( visualization)

 Basic Plots

  • Line graphs: line graphs are used for time series
  • Bar charts: bar charts are used for categorical variables (범주형 변수)
  • Scatterplots: scatterplots display the relationship between two numerical variables

 Distribution Plots

: Distribution plots display 'how many' of each value occurs in a data set 

  or for continuous data or data with many possible values, 

  • Boxplots: 
    Top outliers defined as those above Q3 + 1.5(Q3 - Q1)
    max = maximum of non-outliers
    min = bottom outliers  
    IQR, Inter Quartile Range = Q3 - Q1
    IQR의 약 1.5배의 최소, 최대값에서 벗어나는 경우 이상치로 판별함

    Side-by-side boxplots are useful for comparing subgroups

  • Histogram: Histogram shows the distribution of the outcome variable
  • Heat Maps: Color conveys information.
    Heat Maps is used to visualize correlations and missing data in data mining.

 

Multidimenstional Visualizatoin 

  • Scatterplot with color added
  • Matrix Plot: Matrix shows scatterplots for variable pairs
  • Rescaling to log scale
  • Aggregation(집합)
  • Scatterplot with labels