산업데이터과학
IDS_2주차: Data Visualization
엉덩이싸움
2024. 10. 10. 21:37
EDA, Explorative Data Analysis
: 다양한 차원과 값을 조합해가며 특이점이나 의미있는 사실을 도출하고 분석의 최종 목적을 달성해나가는 과정
1) Verify expected relationships
2) Find some Unexpected structure in the data
3) Deliver data-driven insights in the right questions and not bias the investigation
4) Provide the context around the problem
Main topic in EDA: 저항성의 강조 / 잔차계산 / 자료변수의 재표현 / 그래프를 통한 현시성 제공
Graphs
Graphs are used for data exploration( visualization)
Basic Plots
- Line graphs: line graphs are used for time series
- Bar charts: bar charts are used for categorical variables (범주형 변수)
- Scatterplots: scatterplots display the relationship between two numerical variables
Distribution Plots
: Distribution plots display 'how many' of each value occurs in a data set
or for continuous data or data with many possible values,
- Boxplots:
Top outliers defined as those above Q3 + 1.5(Q3 - Q1)
max = maximum of non-outliers
min = bottom outliers
IQR, Inter Quartile Range = Q3 - Q1
IQR의 약 1.5배의 최소, 최대값에서 벗어나는 경우 이상치로 판별함
Side-by-side boxplots are useful for comparing subgroups
- Histogram: Histogram shows the distribution of the outcome variable
- Heat Maps: Color conveys information.
Heat Maps is used to visualize correlations and missing data in data mining.
Multidimenstional Visualizatoin
- Scatterplot with color added
- Matrix Plot: Matrix shows scatterplots for variable pairs
- Rescaling to log scale
- Aggregation(집합)
- Scatterplot with labels