VIII. Dimensionality Reduction

Filter

A straightforward way to reduce the number of elements shown.

In an interative vis cotext, filtering is often accomplish through dynamic queries
Filtering can applied to both items and attributes.
- Item Filtering: Eliminate items based on their values respect to specific attribiutes. Fewer items are shown, but the number of attributes shown does not change.
- Attribute Filtering: The goal is to eliminate attributes rather than items; that is, to show the same number of items, but fewer attributes for each item.
According to what? Any possible funtion that partitions dataset into two sets
- Attribute value bigger/smaller than X
- Noise/Signal
Pro and Con
- Pro: straightforward and intuitive → to understand and compute
- Con: Out of sight, out of mind ???

<aside> ⚠️

Problems:

The primary issue with filtering is that it can exclude valuable data, making the analysis more limited. While filtering is intuitive and easy to implement, users may struggle to select meaningful ranges for filtering, especially when dealing with unknown datasets.
The biggest issue with filtering is that once data is removed (or filtered out), it's "out of sight, out of mind." This means you lose visibility of the data that you might need later, which can affect insights.

→ Solution: A dynamic querying system can be implemented where the visual encoding and interaction are tightly coupled

</aside>

Statistical Aggregation

A group of elements is represented by a new derived element that stands in for the entire group. A very simple attributes is computing an average, the four other basic aggreration are minimum, maximum, count and sum.

As with filtering, aggregation can be used for both items and attributes.
Pro and Con
- Pro: inform about whole set
- Con: difficult to avoid losing signal

<aside> ⚠️

Problems:

Aggregation merges multiple elements into a new derived element, which can result in the loss of signal (important information) during summarization.
Specific patterns or outliers can be hidden, leading to potential misinterpretations of the data.

→ Solution: Interactive aggregation allows users to adjust the level of aggregation dynamically, helping to visualize data at different granularities while avoiding loss of signal.

</aside>

Most common aggregation strategies - Statistical Plotting is a way of reducing the amount of data to be mapped onto graphics primitives trying to preserve the important imformation.
- Histograms
- Box Plots
- Violin Plots

Idiom: Histograms - static item aggregation

Idiom	Histograms
What	Data: Table
Derived table = New Table: Keys are bins, value are counts
Why: Task	Find distributions

Bin size cruicial
- Pattern can change dramarically depending on discretization (cách chia nhóm)
- Opportunity for interaction: contron bin size on the fly (theo thời gian thực)
- Rules of thumb:
  - bins = $\sqrt{n}$
  - bins = $log_2(n)+1$

Idiom: Box Plots - static item aggregation

Idiom	Histograms
What	Data: Table
Derived table: 5 quantitative attributes mapped

Median: central line
Lower and upper quartile: boxes
Lower upper fences: whiskers
Outliers beyonce fence cutoffs explicity shown | | Why: Task | Find distributions | | Scale | Unlimited number of items |
Good for normally distributed data
Bad for non-normal distributions
Really bad for bimodal or multimodal distributions

Idiom: Violin Plots

Combine the features of box plots and probability density functions, showing both the summary statistics and the shape of the distribution.

Idiom: Density plots

aka kernel density plots, kernel density estimation (KDE)
- Smoothed, continous version of a histogram estimated from data
- Continous curve (the kernel, usually Gaussian bell curve) drawn at each data point
- Add curve together for single smooth density estimation

Idiom: Continuous scatterplots

2D density can be represented with the continuous scatterplot idiom, that is a density map represented with color coding derived from a scatterplot.

Clustering

Classification of items into similar bins, not using pre-existing categories
- Based on similiarity measure = Euclidean distance, Pearson correlation
Partitioning algorithms
- Divide data into set of bins
- bins (k) set manually or automatically
Hierarchical algorithms
- Produce "similarity tree" (dendrograms): cluster hierarchy
- Agglomerative clustering: start w/each node as own cluster, then iteratively merge
Cluster hierarchy: derived data used with many dynamic aggregation idioms