Rather than pursuing a traditional unsupervised clustering optimizationāsuch as variance maximization via the Elbow Methodāthis work evaluates the efficacy of density-based clustering algorithms in replicating a physically motivated halo-based group finder.
We aim to identify the optimal hyperparameter configurations that allow these algorithms to recover the underlying dark matter halo membership of galaxies, effectively using the halo-based catalog as a benchmark for physical validity.
## 'data.frame': 639359 obs. of 9 variables:
## $ GAL_ID : int 750 751 752 994 1010 1030 1038 1042 1049 1050 ...
## $ ra : num 38 38.4 38.4 54.5 54.5 ...
## $ dec : num 0.224 0.212 0.211 0.573 0.579 ...
## $ x : num 0.042 0.0417 0.0419 0.0779 0.0726 ...
## $ y : num 0.0329 0.033 0.0331 0.1091 0.1019 ...
## $ z : num 0.000209 0.000197 0.000196 0.00134 0.001264 ...
## $ redshift: num 0.054 0.0538 0.0541 0.1385 0.129 ...
## $ dist : num 0.0534 0.0532 0.0534 0.1341 0.1251 ...
## $ GROUP_ID: int 68171 23701 23701 68172 68173 68174 23702 23702 68175 68176 ...
To assess the outcomes we will based our analysis on tree basic concepts:
Purity(P): measure of output-cluster: proportion of members coming exclusively from a single true group, providing confidence that the algorithm correctly groups members together. A high purity rate indicates the algorithmās effectiveness in identifying true groups.
Completeness (C): measure of an output-cluster: proportion of data true-group elements included in an output-cluster. A cluster is ācompleteā if contains all points of the original true group.
Recovery (R): measure of an output-cluster (much more restrictive): proportion of output-clusters which are both pure and complete. For this study we consider an output-cluster to be pure if P>=0.66 (at least 2/3 of elements of an output-cluster belong to a single group). An original cluster is complete if C>=0.5 (at least half data belong to an original true group).
Some other stats are:
total_in_group: number of elements in a given group.
total_in_cluster: number of elements in a given output-cluster.
total_in_cluster_group: number of elements in a given output-cluster belonging to a majority-group.
undetected_groups: original groups not detected as majority-group in output-clusters.
detected_groups: original groups detected as majority-group in output-clusters.
Following code is aimed to asses the outcomes obtained with from an given output-cluster:
sys.source(sprintf("%s\\assess.r", includes_folder), envir = knitr::knit_global())
Following code is intended to calculate stats
sys.source(sprintf("%s\\calculate_stats.r", includes_folder), envir = knitr::knit_global())
We generated some functions to help us visualize the results graphically:
sys.source(sprintf("%s\\plotting_functions.r", includes_folder), envir = knitr::knit_global())
Last file to include content function to compute slos distance:
sys.source(sprintf("%s\\slos.r", includes_folder), envir = knitr::knit_global())
There is an initial preprocessing of data-file in order to obtain proper distances and cartesian coordinates x,y,z, by now it is omitted here.
Take a sample bounded by minPts = 5 and RA and DEC:
\[RA \in [90, 150] h,\, DEC \in [15, 25] Āŗ,\, \,and\,\, z \lt max\_redshift\]
Then we take the initial values and take a look at the remaining distribution:
SLOS <- 0.4 #Here the slos is fixed beforehand
min_members <- 5
ra_lim_inf <- 90
ra_lim_sup <- 150
dec_lim_inf <- 15
dec_lim_sup <- 25
max_redshift <- 0.08
min_redshift <- 0.025
# Take a sample using boundaries
mm <- dt[ dt$ra<= ra_lim_sup & dt$ra>=ra_lim_inf &
dt$dec>= dec_lim_inf & dt$dec<=dec_lim_sup &
dt$redshift< max_redshift & dt$redshift> min_redshift,]
# mm is an object containing both groups and galaxy identification
ggplot(mm, aes(x=redshift, y=redshift))+geom_violin()
dim(mm)
## [1] 7229 9
Select groups with more than min_members members queries:
h<-sqldf("select
count(GAL_ID) as members,
GROUP_ID
from
mm
group by
GROUP_ID
order by
members desc")
mm5<-sqldf(sprintf("
SELECT
mm.GAL_ID,
mm.x,
mm.y,
mm.z,
mm.GROUP_ID,
mm.redshift,
mm.dist
FROM
mm as mm, h
where
mm.GROUP_ID=h.GROUP_ID and
h.members >= %s"
, min_members))
Then use it to find the target data:
true_groups <- length(unique(mm5$GROUP_ID))
number_non_isolated_galaxies <- dim(mm5)[1]
number_isolated_galaxies <- dim(mm)[1] - dim(mm5)[1]
print(
sprintf('Number of galaxies in groups with more than %s elements %s out of %s, aprox %s percent',
min_members,
number_non_isolated_galaxies,
dim(mm)[1], format(number_non_isolated_galaxies * 100
/dim(mm)[1], digits=4)))
## [1] "Number of galaxies in groups with more than 5 elements 1237 out of 7229, aprox 17.11 percent"
print(sprintf("Number of groups with more than %s members: %s",
min_members,
true_groups))
## [1] "Number of groups with more than 5 members: 95"
We take a look at the groups with more than min_members:
hhh<- h[h$members>=min_members, ]
ttable <- table(hhh$GROUP_ID, hhh$members)
barplot(ttable, col=('red'),
main=sprintf("Group distribution (%s) with at least %s members ",
true_groups,
min_members))
boxplot(hhh$members, main="Boxplot of wt")
Lets take a look at the complete target sample:
plot3d(mm$x, mm$y, mm$z, col = 'black',
size = 1, xlab = "X", ylab = "Y", zlab = "Z")
aa <- sqldf("select
GAL_ID,
x,
y,
z,
case
when group_id IN (Select GROUP_ID from mm5) then group_id
else 0
end as group_id,
redshift,
dist
from
mm")
plot3d(aa$x, aa$y, aa$z, col = aa$group_id+1,
size = 2, xlab = "X", ylab = "Y", zlab = "Z")
And the groups with more than min_members:
plot3d(mm5$x, mm5$y, mm5$z, col = mm5$GROUP_ID,
size = 2, xlab = "X", ylab = "Y", zlab = "Z")
We will process the data without any kind of scale or normalization.
Optics clustering:
points<- mm[,c('x', 'y', 'z')]
#clustering
res <- optics(points, minPts = min_members)
plot(res)
In the previous plot we can see how OPTICS modeling valleys (clusters) and the peaks (cluster-separation).
Execute with \(\xi=0.3\):
optics <- extractXi(res, xi=0.15)
plot(optics)
Take a plot of the clustering obtained:
plot3D_cluster(optics, mm)
Stats with different values
mm$cluster_id <- as.numeric(optics$cluster)
mm5 <- get_elements_in_m5_groups(mm)
all <- execute_stats(mm, optics)
print_stats(all)
## [1] "Mean purity 0.424601883018884"
## [1] "Mean completness 0.851048303520381"
## [1] "Sum. recovery 0.0800323362974939"
## [1] "Undetected groups 53 out of 95"
## [1] "Detected pure and complete groups 13 out of 95"
## [1] "Detected real groups 51 out of 95"
This algorithm do not offer good results in cluster detection.
We can directly apply extractDBSCAN on the OPTICS model.
blo_scan <- extractDBSCAN(res, eps_cl = 0.00075)
mm$cluster_id <- blo_scan$cluster
mm5 <- get_elements_in_m5_groups(mm)
plot(blo_scan)
We have, on one hand all groups with more than min_members members (made up by a reduced amount of galaxies from catalog)
in the other hand the output-clusters from DBSCAN:
plot3D_cluster(blo_scan, points)
all <- execute_stats(mm, blo_scan)
head(all, 5)
## cluster_id group_id total_in_group total_in_cluster total_in_cluster_group
## 1 1 229735 1 7 1
## 2 3 6818 5 27 5
## 3 9 36542 2 5 2
## 4 8 667 4 4 3
## 5 184 442 30 107 29
## purity completn spurious bad_class is_pur is_comp recovery is_real
## 1 0.1428571 1.0000000 6 0 0 1 0.00000000 0
## 2 0.1851852 1.0000000 22 0 0 1 0.00000000 1
## 3 0.4000000 1.0000000 3 0 0 1 0.00000000 0
## 4 0.7500000 0.7500000 1 1 1 1 0.00323363 0
## 5 0.2710280 0.9666667 78 1 0 1 0.00000000 1
print_stats(all)
## [1] "Mean purity 0.331716556063967"
## [1] "Mean completness 0.987335200247408"
## [1] "Sum. recovery 0.0695230396119644"
## [1] "Undetected groups 48 out of 95"
## [1] "Detected pure and complete groups 13 out of 95"
## [1] "Detected real groups 47 out of 95"
We can now test over different values in order to obtain optimal eps_cl hyper-parameter:
#It is easy to transform onto a function which admits a sequence and a res set.
eps_sequence_test <- seq(0.0001, 0.0007, 0.0002)
x_stats <- extract_stats_dbscan(eps_sequence_test, res)
## [1] "Extracting DBSCAN stats for epsI=1e-04"
## [1] "Extracting DBSCAN stats for epsI=3e-04"
## [1] "Extracting DBSCAN stats for epsI=5e-04"
## [1] "Extracting DBSCAN stats for epsI=7e-04"
Show the results obtained:
print_global_stats(x_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.104655712050078 0.587461969390989 0.900311864752456 0.976725345423947"
## [1] "Purity 1 0.780112362242708 0.487567823777937 0.363354522087879"
## [1] "Groups 2 89 242 256"
## [1] "Clusters 2 117 245 256"
## [1] "EPS 1e-04 3e-04 5e-04 7e-04"
## [1] "True Groups 95"
## [1] "Und. Groups 93 45 32 43"
## [1] "Complete gr.: 0 67 233 255"
## [1] "Pure gr.: 2 83 59 30"
## [1] "P.+ C. gr.: 0 38 47 30"
## [1] "Real groups: 2 78 66 52"
## [1] "Fr: 0 0.322033898305085 0.191056910569106 0.116731517509728"
## [1] "Fp: 0.666666666666667 0.703389830508475 0.239837398373984 0.116731517509728"
## [1] "FC: 0 0.567796610169492 0.947154471544715 0.992217898832685"
## [1] "Spurious: 0 1.53846153846154 6.43673469387755 13.1328125"
## [1] "Bad class: 48.5 12.6324786324786 1.11836734693878 0.3828125"
## [1] "Recovery: 0 0.234438156831043 0.265157639450283 0.0994341147938561"
Lets get a plots for completeness and purity:
We have the optimal point at $\epsilon= 0.0006, $ where recovery =50% and completeness= 81% purity=65.5%
seleced_eps<- 0.00037
plot_purity_completeness(
eps_sequence_test,
x_stats$purity_list,
x_stats$completeness_list,
x_stats$recovery,
c('Purity', 'Completeness', 'Recovery'),
"Purity/completeness on raw data",
seleced_eps, 'eps', 'Percentage'
)
This chart shows that optimal point is around \(0.00037\). Is at this value where completeness is maximum and purity is still high.
plot_purity_completeness(
eps_sequence_test,
x_stats$real_list/true_groups,
x_stats$und_gr/true_groups,
x_stats$pure_complet/true_groups,
c('Detected', 'Undetected gr.', 'Pure + Complet.'),
"Group global % detection stats on raw data",
seleced_eps , 'eps', 'Percentage'
)
According with previous chart, the optimal value is \(\epsilon = 6.10^{4}\). Given that at this value is reach the maximum purity, purity+completeness and the number of undetected groups remains at minimum.
plot_purity_completeness(
eps_sequence_test,
x_stats$und_gr,
x_stats$real_list,
true_groups - 0,
c('Undetected', 'Groups', 'True groups'),
"Total detection on raw data",
seleced_eps , 'eps', 'Groups number'
)
Once again the optimal point is at \(\epsilon = 3.70^{4}\).
As said from theory, HDBSCAN does not generate a great model because it ability to detect clusters in sparse areas. It cause detect noise as clusters.
cl <- hdbscan(points, minPts = 5)
length(unique( cl$cluster))
## [1] 376
plot3D_cluster(cl, mm)
HDBSCAN do not work pretty well because it detects cluster in sparser areas which gives as a result cluster detection on noise regions.
Alex Rodriguez and Alessandro Laio (2014).
https://github.com/thomasp85/densityClust
By making this way it appears some clusters:
galaxyDens <- densityClust(points)
galaxyClusters <- findClusters(galaxyDens, rho=0.997, delta=0.001)
plot(galaxyClusters)
# do not use this takes a lot!!:
#plotMDS(galaxyClusters)
mm$cluster_id <- galaxyClusters$cluster
all <- execute_stats(mm, galaxyClusters)
print_stats(all)
## [1] "Mean purity 0.5961860903516"
## [1] "Mean completness 0.989302040312048"
## [1] "Sum. recovery 0.325788197251415"
## [1] "Undetected groups 36 out of 95"
## [1] "Detected pure and complete groups 13 out of 95"
## [1] "Detected real groups 65 out of 95"
DPC is analogous to HDBCAN: the model do not fit well for the same reason: detecting clusters in noise regions.
May be DPC is not good in finding cluster but it can be useful in finding the centers by finding the peaks of density, as we show bellow:
galaxyClusters <- findClusters(galaxyDens, rho=0.9985, delta=0.00084)
peaks <- mm[galaxyClusters$peaks,]
print(sprintf("DPC detected %s out of %s clusters", length(unique(peaks$GROUP_ID)), true_groups))
## [1] "DPC detected 113 out of 95 clusters"
In fact we can create a heat-map for \(\rho\) and \(\delta\) hyper-parameters:
rhos <- seq(0.9980, 0.999, 0.0001)
deltas <- seq(0.00075, 0.00085, 0.00001)
ll <- length(deltas)
j<-1
i<-ll
matrix_data=matrix(ncol=length(rhos), nrow = length(deltas))
for (rho in rhos){
for (delta in deltas){
galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
peaks <- mm[galaxyClusters$peaks,]
l <- length(unique(peaks$GROUP_ID))
#print(min(l/true_groups, true_groups/l))
matrix_data[i, j] <- min(l/true_groups, true_groups/l)
i <- i-1
}
j <- j+1
i<-ll
}
custom_heatmap(matrix_data, deltas, rhos, xTitle = "rho", yTitle = "delta", numColors = 11)
As we can see, in \(\delta=0.00080\) and \(\rho=0.9985\) produce the best results with a 99% of group-center detection, however we can not conclude that all centers represent any original cluster, so the center detection did not work.
galaxyClusters <- findClusters(galaxyDens, rho=0.9985, delta=0.0008)
peaks <- mm[galaxyClusters$peaks,]
plot3d(peaks$x, peaks$y, peaks$z, col = peaks$GROUP_ID+1,
size = 3, xlab = "X", ylab = "Y", zlab = "Z")
We will perform a scale of data:
points_scaled <- scale(points)
ress <- optics(points_scaled, minPts = min_members)
#optimal value obtained
blo_scans <- extractDBSCAN(ress, eps_cl = 0.025)
mm$cluster_id <- blo_scans$cluster
mm5 <- get_elements_in_m5_groups(mm)
Again we can do the same for scaled data:
eps_sequence_test <- seq(0.025, 0.04, 0.005)
x_stats <- extract_stats_dbscan(eps_sequence_test, ress)
## [1] "Extracting DBSCAN stats for epsI=0.025"
## [1] "Extracting DBSCAN stats for epsI=0.03"
## [1] "Extracting DBSCAN stats for epsI=0.035"
## [1] "Extracting DBSCAN stats for epsI=0.04"
print_global_stats(x_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.473561897732598 0.616829208904767 0.692938249329233 0.797341370519604"
## [1] "Purity 0.850963117342428 0.763104516529589 0.692646481028888 0.606210510933657"
## [1] "Groups 61 99 123 160"
## [1] "Clusters 87 122 143 174"
## [1] "EPS 0.025 0.03 0.035 0.04"
## [1] "True Groups 95"
## [1] "Und. Groups 56 43 33 31"
## [1] "Complete gr.: 40 75 104 145"
## [1] "Pure gr.: 69 84 83 73"
## [1] "P.+ C. gr.: 24 39 49 51"
## [1] "Real groups: 65 75 82 78"
## [1] "Fr: 0.272727272727273 0.317073170731707 0.340277777777778 0.291428571428571"
## [1] "Fp: 0.784090909090909 0.682926829268293 0.576388888888889 0.417142857142857"
## [1] "FC: 0.454545454545455 0.609756097560976 0.722222222222222 0.828571428571429"
## [1] "Spurious: 0.735632183908046 1.40983606557377 2.46853146853147 3.82183908045977"
## [1] "Bad class: 20.4022988505747 10.7131147540984 8.09090909090909 4.49425287356322"
## [1] "Recovery: 0.144704931285368 0.398544866612773 0.576394502829426 0.473726758286176"
Results look sligthly better when all variables are scaled to a mean=0, sd=1. The optimal value of eps gives more than 69% for both purity and completeness.
The same plots before
seleced_eps <- 0.035
plot_purity_completeness(
eps_sequence_test,
x_stats$purity_list,
x_stats$completeness_list,
x_stats$recovery,
c('Purity', 'Completeness', 'Recovery'),
"Purity/completeness on scaled data",
seleced_eps, 'eps', 'Percentage'
)
plot_purity_completeness(
eps_sequence_test,
x_stats$und_gr,
x_stats$real_list,
true_groups - 0,
c('Undetected', 'Groups', 'True groups'),
"Total detection on scaled data",
seleced_eps , 'eps', 'Groups number'
)
We have selected that minPts = min_members, which is a reasonable value for interpreting a group / clustering.
From the DBSCAN theory we can use the elbow on:
kNNdistplot(x = points_scaled, k = min_members)
abline(h = seleced_eps, lty = 3)
How reader can see, the āelbow methodā does not apply here, the reason for this is that elbow method try to optimize variances in clusters. Rather, we are trying to detect virialized galaxy groups, but algorithms are not intrinsically designed to account for the specific density profiles of dark matter halos.
The element whose groups were not detected are:
blo_scans <- extractDBSCAN(ress, eps_cl = seleced_eps)
mm$cluster_id <- blo_scans$cluster
all <- execute_stats(mm, blo_scans)
undetected <- get_elements_not_in_groups(mm5, all)
detected <- get_elements_in_groups(mm, all)
print(sprintf('Undetected groups: %s', length(unique(undetected$GROUP_ID))))
## [1] "Undetected groups: 33"
print(sprintf('Detected groups: %s', length(unique(detected$GROUP_ID))))
## [1] "Detected groups: 123"
Plot detected groups
plot3d(detected$x, detected$y, detected$z,
col = detected$cluster_id +1, size = 2,
xlab = "X", ylab = "Y", zlab = "Z")
Plot undetected groups
plot3d(undetected$x, undetected$y, undetected$z,
col = undetected$GROUP_ID, size = 2,
xlab = "X", ylab = "Y", zlab = "Z")
We will use the code to calculate the s-distance, which consists in make a elongation along the line of sight.
We can calculate distance matrix to apply OPTICS to this new metric, executing following code may take a long time:sDistances Matrix Calculation
SLOS <- 0
a<- mm[,c('x', 'y', 'z', 'dist', 'redshift')]
distance_matrix <- get_matrix_of_distances(a)
#Following code last a long time to execute
dist_object <- as.dist(distance_matrix)
Apply OPTICS algorithm and then obtain a reachability plot:
sres <- optics(dist_object, minPts = min_members)
plot(sres)
sblo_scan <- extractDBSCAN(sres, eps_cl = 0.00025)
mm$cluster_id <- sblo_scan$cluster
#mm5 <- get_elements_in_groups(mm)
plot(sblo_scan)
We obtain a notable improvement:
sall <- execute_stats(mm, sblo_scan)
print_stats(sall)
## [1] "Mean purity 0.431980396663884"
## [1] "Mean completness 0.99406555643669"
## [1] "Sum. recovery 0.29749393694422"
## [1] "Undetected groups 30 out of 95"
## [1] "Detected pure and complete groups 13 out of 95"
## [1] "Detected real groups 65 out of 95"
Test several hyper-parameters:
#optimal value found at 0.00015
#eps_sequence_test <- seq(0.000085, 0.00020, 0.00005)
eps_sequence_test <- seq(0.00005, 0.00025, 0.00005)
sx_stats <- extract_stats_dbscan(eps_sequence_test, sres)
## [1] "Extracting DBSCAN stats for epsI=5e-05"
## [1] "Extracting DBSCAN stats for epsI=1e-04"
## [1] "Extracting DBSCAN stats for epsI=0.00015"
## [1] "Extracting DBSCAN stats for epsI=2e-04"
## [1] "Extracting DBSCAN stats for epsI=0.00025"
print_global_stats(sx_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.568991669563859 0.820834474427593 0.953146668561144 0.975637190444883 0.99406555643669"
## [1] "Purity 0.86246825950707 0.719006683502973 0.597271129268692 0.513046960218581 0.431980396663884"
## [1] "Groups 56 137 191 260 291"
## [1] "Clusters 64 144 191 260 291"
## [1] "EPS 5e-05 1e-04 0.00015 2e-04 0.00025"
## [1] "True Groups 95"
## [1] "Und. Groups 51 16 13 18 30"
## [1] "Complete gr.: 40 131 189 259 291"
## [1] "Pure gr.: 53 84 79 77 55"
## [1] "P.+ C. gr.: 30 74 77 77 55"
## [1] "Real groups: 52 86 82 77 65"
## [1] "Fr: 0.461538461538462 0.510344827586207 0.401041666666667 0.295019157088123 0.188356164383562"
## [1] "Fp: 0.815384615384615 0.579310344827586 0.411458333333333 0.295019157088123 0.188356164383562"
## [1] "FC: 0.615384615384615 0.903448275862069 0.984375 0.992337164750958 0.996575342465753"
## [1] "Spurious: 0.875 2.8125 4.95811518324607 6.29615384615385 9.47079037800687"
## [1] "Bad class: 14.15625 4.31944444444444 0.340314136125654 0.130769230769231 0.0240549828178694"
## [1] "Recovery: 0.285367825383994 0.68714632174616 0.638641875505255 0.455133387227163 0.29749393694422"
seleced_eps <- 0.0001
plot_purity_completeness(
eps_sequence_test,
sx_stats$purity_list,
sx_stats$completeness_list,
sx_stats$recovery,
c('Purity', 'Completeness', 'Recovery'),
"Purity/completeness on sLOS data",
seleced_eps, 'eps', 'Percentage'
)
There is a notable improvement of the results with the sLos distances.
Finally, for groups:
plot_purity_completeness(
eps_sequence_test,
sx_stats$real_list/true_groups,
sx_stats$und_gr/true_groups,
sx_stats$pure_complet/true_groups,
c('Detected', 'Undetected gr.', 'Pure + Complet.'),
"Group global % detection stats on sLOS data",
seleced_eps , 'eps', 'Percentage'
)
plot_purity_completeness(
eps_sequence_test,
sx_stats$und_gr,
sx_stats$real_list,
true_groups - 0,
c('Undetected', 'Groups', 'True groups'),
"Total detection on sLOS data",
seleced_eps , 'eps', 'Groups number'
)
Detected and undetected groups:
sblo_scan <- extractDBSCAN(sres, eps_cl = seleced_eps)
mm$cluster_id <- sblo_scan$cluster
all <- execute_stats(mm, sblo_scan)
undetected <- get_elements_not_in_groups(mm5, all)
detected <- get_elements_in_groups(mm, all)
print(sprintf('Undetected groups: %s', length(unique(undetected$GROUP_ID))))
## [1] "Undetected groups: 16"
print(sprintf('Detected groups: %s', length(unique(detected$GROUP_ID))))
## [1] "Detected groups: 137"
Plot detected groups
plot3d(detected$x, detected$y, detected$z,
col = detected$GROUP_ID +1, size = 2,
xlab = "X", ylab = "Y", zlab = "Z")
Plot not detected groups
plot3d(undetected$x, undetected$y, undetected$z,
col = undetected$GROUP_ID +1, size = 2,
xlab = "X", ylab = "Y", zlab = "Z")
The bests results obtained with each method:
| Method | Data Sample | Outcomes | Conclusion |
|---|---|---|---|
| OPTICS | Non-scaled | - | Good in cluster reachability plot |
| OPTICS Xi hierarchical method | Non-scaled | Not applicable. | Not valuable result were found. |
| DBSCAN | Non-scaled | P: 0.68 C: 0.68 R: 0.25 |
Good in cluster detection but low recovery. Which indicates the groups are not globally recovered. |
| HDBSCAN | Non-scaled | - | Not good in cluster detection. |
| DPC | Non-scaled | 99% group-center detection. | Good in group center detection, but detected centers do not match original groups. \(\delta=0.0008\) and \(\rho=0.9986\) |
| sOPTICS | Scaled | P: 0.72 C: 0.82 R: 0.69 U: 16 |
0.0001 |
| OPTICS | Scaled | - | Good in cluster reachability plot |
| DBSCAN | Scaled | P: 0.69 C: 0.69 R: 0.57 U: 33 |
Good in cluster detection with eps = 0.035 but low recovery. Which indicates the groups are not globally recovered. |
In conclusion. In this study we take a sample from the SDSS catalog and selected 7229 objects within an area between 3 and 5h in RA and 15 and 25Āŗ in DEC. We tried several methods in groups detecting selecting groups with a minimum of 5 elements (galaxies), we got 95 groups.
We can stand out the following points:
The method sOPTICS with DBSCAN reached the highest values in completeness, purity and recovery and in group detection (73%).
The DPC algorithm worked almost perfectly in group-center detection (reached a 99%), but detected centers do not match original groups. .
A normalization or scaled of data data slightly improve results, this is because all data are in fact in approx. in same scale, given that all coordinates: x, y, z and dist are all of them expressed under same units and in same range of values.
Finally we compare the original with the best clustering obtained in this study:
| Actual groups distribution | Best cluster detection with sOPTICS. |
|---|---|