1 The clustering problem on galaxy dataset

Rather than pursuing a traditional unsupervised clustering optimization—such as variance maximization via the Elbow Method—this work evaluates the efficacy of density-based clustering algorithms in replicating a physically motivated halo-based group finder.

We aim to identify the optimal hyperparameter configurations that allow these algorithms to recover the underlying dark matter halo membership of galaxies, effectively using the halo-based catalog as a benchmark for physical validity.

2 Loading data files

2.1 Galaxy catalog for 2dFGRS

## 'data.frame':    104912 obs. of  9 variables:
##  $ GAL_ID  : int  2 3 5 6 7 8 11 12 13 14 ...
##  $ ra      : num  3.63 3.59 3.61 3.61 3.61 ...
##  $ dec     : num  -33 -32.4 -32.7 -32.9 -33 ...
##  $ x       : num  0.1 0.0854 0.0849 0.1063 0.0896 ...
##  $ y       : num  0.00634 0.00535 0.00535 0.00671 0.00566 ...
##  $ z       : num  -0.065 -0.0543 -0.0547 -0.0688 -0.0584 ...
##  $ redshift: num  0.123 0.104 0.104 0.131 0.11 ...
##  $ dist    : num  0.119 0.101 0.101 0.127 0.107 ...
##  $ GROUP_ID: int  12097 542 4846 4847 1462 1462 542 4846 4847 1462 ...

3 Generic functions for assessments

3.1 Functions to asses outcomes

To assess the outcomes we will based our analysis on tree basic concepts:

Purity(P): measure of output-cluster: proportion of members coming exclusively from a single true group, providing confidence that the algorithm correctly groups members together. A high purity rate indicates the algorithm’s effectiveness in identifying true groups.

Completeness (C): measure of an output-cluster: proportion of data true-group elements included in an output-cluster. A cluster is “complete” if contains all points of the original true group.

Recovery (R): measure of an output-cluster (much more restrictive): proportion of output-clusters which are both pure and complete. For this study we consider an output-cluster to be pure if P>=0.66 (at least 2/3 of elements of an output-cluster belong to a single group). An original cluster is complete if C>=0.5 (at least half data belong to an original true group).

Some other stats are:

total_in_group: number of elements in a given group.

total_in_cluster: number of elements in a given output-cluster.

total_in_cluster_group: number of elements in a given output-cluster belonging to a majority-group.

undetected_groups: original groups not detected as majority-group in output-clusters.

detected_groups: original groups detected as majority-group in output-clusters.

Following code is aimed to asses the outcomes obtained with from an given output-cluster:

sys.source(sprintf("%s\\assess.r", includes_folder), envir = knitr::knit_global())

Following code is intended to calculate stats

sys.source(sprintf("%s\\calculate_stats.r", includes_folder), envir = knitr::knit_global())

We generated some functions to help us visualize the results graphically:

sys.source(sprintf("%s\\plotting_functions.r", includes_folder),  envir = knitr::knit_global())

Last file to include content function to compute slos distance:

sys.source(sprintf("%s\\slos.r", includes_folder),  envir = knitr::knit_global())

4 Data Preprocessing

4.1 Descriptive analysis and visualization

There is an initial pre-processing of data-file in order to obtain proper distances and Cartesian coordinates x,y,z, by now it is omitted here.

Take a sample bounded by minPts = 5 and in RA and DEC:

\[RA \in [0, 2]h,\, DEC \in [-36, -25]º,\, \,and\,\, z \lt max\_redshift\]

Then we take the initial values and take a look at the remaining distribution:

SLOS <- 0 #Here the slos is not fixed beforehand
min_members <- 5
max_redshift <- 0.1
min_redshift <- 0.0
ra_lim_inf <-  0
ra_lim_sup <- 30
dec_lim_inf <- -36
dec_lim_sup <-  -25

# Take a sample using boundaries
mm <- dt[ dt$ra<= ra_lim_sup & dt$ra>=ra_lim_inf & 
            dt$dec>= dec_lim_inf & dt$dec<=dec_lim_sup &
            dt$redshift< max_redshift & dt$redshift> min_redshift,]


# mm is an object containing both groups and galaxy identification
ggplot(mm, aes(x=redshift, y=redshift))+geom_violin()

dim(mm)
## [1] 6173    9

Select groups with more than min_members members queries:

h<-sqldf("select 
            count(GAL_ID) as members, 
            GROUP_ID 
          from 
            mm 
          group by 
            GROUP_ID 
          order by  
            members desc")
mm5<-sqldf(sprintf("
    SELECT 
        mm.GAL_ID,
        mm.x, 
        mm.y, 
        mm.z, 
        mm.GROUP_ID, 
        mm.redshift, 
        mm.dist
      FROM 
        mm as mm, h 
      where 
          mm.GROUP_ID=h.GROUP_ID and 
          h.members >= %s"
      , min_members))

Then use it to find the target data:

true_groups <- length(unique(mm5$GROUP_ID))
number_non_isolated_galaxies <- dim(mm5)[1]
number_isolated_galaxies <- dim(mm)[1] - dim(mm5)[1]
print(sprintf('Number of galaxies in groups with 
              more than %s elements %s out of %s, aprox %s percent', 
          min_members, 
          number_non_isolated_galaxies, 
          dim(mm)[1], format(number_non_isolated_galaxies * 100
                                /dim(mm)[1], digits=4)))
## [1] "Number of galaxies in groups with \n              more than 5 elements 857 out of 6173, aprox 13.88 percent"
print(sprintf("Number of groups with more than %s members: %s", 
              min_members, 
              true_groups))
## [1] "Number of groups with more than 5 members: 92"

Do not work:

tt<-sqldf(c("
       UPDATE mm
      set group_id=0
       WHERE group_id NOT IN (select distinct(group_id) from mm5)
       ", "select * from main.mm"))
## Warning in result_fetch(res@ptr, n = n): `dbGetQuery()`, `dbSendQuery()` and
## `dbFetch()` should only be used with `SELECT` queries. Did you mean
## `dbExecute()`, `dbSendStatement()` or `dbGetRowsAffected()`?

We take a look at the groups with more than min_members:

hhh<- h[h$members>=min_members, ]
ttable <- table(hhh$GROUP_ID, hhh$members)
barplot(ttable, col=('red'), 
        main=sprintf("Group distribution (%s) with at least %s members ", 
                     true_groups, 
                     min_members))

boxplot(hhh$members, main="Boxplot of num of members in groups")

Lets take a look at the complete target sample:

plot3d(mm$x, mm$y, mm$z, col = 'black', 
       size = 1, xlab = "X", ylab = "Y", zlab = "Z")

aa <- sqldf("select 
          GAL_ID,
          x, 
          y, 
          z, 
            case 
              when group_id IN (Select GROUP_ID from mm5) then group_id
          else 0
            end as group_id, 
          redshift, 
          dist
      from 
            mm")

plot3d(aa$x, aa$y, aa$z, col = aa$group_id+1, 
       size = 2, xlab = "X", ylab = "Y", zlab = "Z")

And the groups with more than min_members:

plot3d(mm5$x, mm5$y, mm5$z, col = mm5$GROUP_ID, 
       size = 2, xlab = "X", ylab = "Y", zlab = "Z")

5 Raw data processing

We will process the data without any kind of scale or normalization.

5.1 OPTICS

Optics clustering:

points<- mm[,c('x', 'y', 'z')]

#clustering
res <- optics(points, minPts = min_members)
plot(res)

In the previous plot we can see how OPTICS modeling valleys (clusters) and the peaks (cluster-separation).

Execute extractXI to obtain a hierarchical clustering with \(\xi=0.15\):

optics <- extractXi(res, xi=0.15)
plot(optics)

5.1.1 Output-clusters visualization

Take a plot of the clustering obtained:

plot3D_cluster(optics, mm)

As shown, this method do not work in clustering detection.

Stats with different values:

mm$cluster_id <- as.numeric(optics$cluster)
mm5 <- get_elements_in_m5_groups(mm)
all <- execute_stats(mm, optics)
print_stats(all)
## [1] "Mean purity 0.504594006834947"
## [1] "Mean completness 0.911911226028873"
## [1] "Sum. recovery  0.170361726954492"
## [1] "Undetected groups 53 out of 92"
## [1] "Detected pure and complete groups 13  out of 92"
## [1] "Detected real groups  41  out of 92"

This algorithm do not offer good results in cluster detection and will not be used in this study.

5.2 DBSCAN

We can directly apply extractDBSCAN on the OPTICS model.

blo_scan <- extractDBSCAN(res, eps_cl = 0.00075)
mm$cluster_id <- blo_scan$cluster
mm5 <- get_elements_in_m5_groups(mm)
plot(blo_scan)

5.2.1 Output-clusters visualization

We have, on one hand all groups with more than min_members members (made up by a reduced amount of galaxies from catalog)

in the other hand the output-clusters from DBSCAN:

plot3D_cluster(blo_scan, points)

all <- execute_stats(mm, blo_scan)
head(all, 5)
##   cluster_id group_id total_in_group total_in_cluster total_in_cluster_group
## 1          3     2535              3               12                      3
## 2        119     1649              4               28                      4
## 3        114      774              6               22                      6
## 4        165     1002              5               11                      4
## 5        145    12501              1                1                      1
##      purity completn spurious bad_class is_pur is_comp    recovery is_real
## 1 0.2500000      1.0        9         0      0       1 0.000000000       0
## 2 0.1428571      1.0       24         0      0       1 0.000000000       0
## 3 0.2727273      1.0       16         0      0       1 0.000000000       1
## 4 0.3636364      0.8        7         1      0       1 0.000000000       1
## 5 1.0000000      1.0        0         0      1       1 0.001166861       0
print_stats(all)
## [1] "Mean purity 0.527975866212904"
## [1] "Mean completness 0.923621835740949"
## [1] "Sum. recovery  0.325554259043174"
## [1] "Undetected groups 29 out of 92"
## [1] "Detected pure and complete groups 13  out of 92"
## [1] "Detected real groups  65  out of 92"

We can now test over different values in order to obtain optimal eps_cl hyper-parameter:

#It is easy to transform onto a function which admits a sequence and a res set.
eps_sequence_test <- seq(0.0004, 0.001, 0.0002)
x_stats <- extract_stats_dbscan(eps_sequence_test, res)
## [1] "Extracting DBSCAN stats for epsI=4e-04"
## [1] "Extracting DBSCAN stats for epsI=6e-04"
## [1] "Extracting DBSCAN stats for epsI=8e-04"
## [1] "Extracting DBSCAN stats for epsI=0.001"

Show the results obtained:

print_global_stats(x_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.658614803430983 0.873841178467001 0.927884965991974 0.973643216080402"
## [1] "Purity 0.753499621447735 0.653741237702159 0.496846448200032 0.3599267698837"
## [1] "Groups 73 126 179 199"
## [1] "Clusters 80 128 181 199"
## [1] "EPS 4e-04 6e-04 8e-04 0.001"
## [1] "True Groups 92"
## [1] "Und. Groups 37 21 36 51"
## [1] "Complete gr.: 59 119 178 198"
## [1] "Pure gr.: 51 63 41 17"
## [1] "P.+ C. gr.: 36 55 40 17"
## [1] "Real groups: 62 73 58 41"
## [1] "Fr: 0.444444444444444 0.426356589147287 0.21978021978022 0.085"
## [1] "Fp: 0.62962962962963 0.488372093023256 0.225274725274725 0.085"
## [1] "FC: 0.728395061728395 0.922480620155039 0.978021978021978 0.99"
## [1] "Spurious: 2.8875 4.9765625 8.1878453038674 13.8291457286432"
## [1] "Bad class: 4.625 1.109375 0.458563535911602 0.206030150753769"
## [1] "Recovery: 0.358226371061844 0.421236872812135 0.228704784130688 0.0851808634772462"

Lets get a plots for completeness and purity:

We have the optimal point at \(\epsilon = 6.10^{-4}\), $ where recovery =50% and completeness= 87% purity=65%

seleced_eps <- 0.0006
plot_purity_completeness(
  eps_sequence_test, 
  x_stats$purity_list, 
  x_stats$completeness_list, 
  x_stats$recovery,
  c('Purity', 'Completeness', 'Recovery'),
  "Purity/completeness on raw data",
  seleced_eps, 'eps', 'Percentage'
)

This chart shows that optimal point is around \(0.0006\). Is at this value where completeness is maximum and purity is still high.

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$real_list/true_groups, 
  x_stats$und_gr/true_groups, 
  x_stats$pure_complet/true_groups,
  c('Detected', 'Undetected gr.', 'Pure + Complet.'),
  "Group global % detection stats on raw data",
  seleced_eps , 'eps', 'Percentage'
)

For groups detection, we have:

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$und_gr,
  x_stats$real_list, 
  true_groups - 0,
  c('Undetected', 'Groups', 'True groups'),
  "Total detection on raw data",
   seleced_eps , 'eps', 'Groups number'
)

6 HDBSCAN data processing

As said from theory, HDBSCAN does not generate a great model because it ability to detect clusters in sparse areas. It cause detect noise as clusters.

cl <- hdbscan(points, minPts = 5)
length(unique( cl$cluster))
## [1] 278

6.1 Output-clusters visualization

plot3D_cluster(cl, mm)

HDBSCAN do not work pretty well because it detects cluster in sparser areas which gives as a result cluster detection on noise regions.

7 Density Peaks Clustering(DPC )

Alex Rodriguez and Alessandro Laio (2014).

https://github.com/thomasp85/densityClust

By making this way it appears some clusters:

galaxyDens <- densityClust(points)
galaxyClusters <- findClusters(galaxyDens, rho=0.997, delta=0.00086)
plot(galaxyClusters)

# do not use this takes a lot!!:
#plotMDS(galaxyClusters)

mm$cluster_id <- galaxyClusters$cluster
all <- execute_stats(mm, galaxyClusters)
print_stats(all)
## [1] "Mean purity 0.709011309334738"
## [1] "Mean completness 0.984867626208164"
## [1] "Sum. recovery  1.41540256709452"
## [1] "Undetected groups 23 out of 92"
## [1] "Detected pure and complete groups 13  out of 92"
## [1] "Detected real groups  78  out of 92"

DPC is analogous to HDBCAN: the model do not fit well for the same reason: detecting clusters in noise regions.

May be DPC is not good in finding cluster but it can be useful in finding the centers by finding the peaks of density, as we show bellow:

galaxyClusters <- findClusters(galaxyDens, rho=0.9985, delta=0.00084)
peaks <- mm[galaxyClusters$peaks,]
print(sprintf("DPC detected %s out of %s clusters", length(unique(peaks$GROUP_ID)), true_groups))
## [1] "DPC detected 87 out of 92 clusters"

In fact we can create a heat-map for \(\rho\) and \(\delta\) hyper-parameters:

rhos <- seq(0.9980, 0.999, 0.0001)
deltas <- seq(0.0008, 0.0009, 0.00001)

ll <-  length(deltas)
j<-1
i<-ll

matrix_data=matrix(ncol=length(rhos), nrow = length(deltas)) 
for (rho in rhos){
    for (delta in deltas){
        galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
        peaks <- mm[galaxyClusters$peaks,]
        l <- length(unique(peaks$GROUP_ID))
        #print(min(l/true_groups, true_groups/l))
        matrix_data[i, j] <- min(l/true_groups, true_groups/l)
        #matrix_data[j, i] <- min(l/true_groups, true_groups/l)
        i <- i-1
    }
    j <- j+1
    i<-ll
}


custom_heatmap(matrix_data, deltas, rhos, xTitle = "rho", yTitle = "delta", numColors = 11)

In which case, we can see that $\delta=0.00083$$ and \(\rho=0.9985\) produce the best results with a 97% of group-centers detection, however we can not conclude that all centers represent any original cluster, so the center detection did not work.

In fact if we check the real groups with less than min_members:

rhos <- seq(0.9980, 0.999, 0.0001)
deltas <- seq(0.0008, 0.0009, 0.00001)

ll <-  length(deltas)
j<-1
i<-ll

matrix_data=matrix(ncol=length(rhos), nrow = length(deltas)) 
for (rho in rhos){
    for (delta in deltas){
        galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
        peaks <- mm5[galaxyClusters$peaks,]
        l <- length(unique(peaks$GROUP_ID))
        #print(min(l/true_groups, true_groups/l))
        matrix_data[i, j] <- min(l/true_groups, true_groups/l)
        #matrix_data[j, i] <- min(l/true_groups, true_groups/l)
        i <- i-1
    }
    j <- j+1
    i<-ll
}


custom_heatmap(matrix_data, deltas, rhos, xTitle = "rho", yTitle = "delta", numColors = 11)

As we can see, the optimal hyper-parameter value set only get a 13% of original halos.

7.1 Output-clusters visualization

galaxyClusters <- findClusters(galaxyDens, rho=0.9985, delta=0.00083)
peaks <- mm[galaxyClusters$peaks,]
plot3d(peaks$x, peaks$y, peaks$z, col = peaks$GROUP_ID+1, 
         size = 3, xlab = "X", ylab = "Y", zlab = "Z")

And decision graph remains finally:

plot(galaxyClusters)
abline(h =0.00084, lty = 5)
abline(v=0.9985, col="blue")

Note that DPC is good in obtaining groups centers but not in separate the groups over the global datase as we can see

plot3d(mm$x, mm$y, mm$z, col = mm$cluster_id+1, 
       size = 3, xlab = "X", ylab = "Y", zlab = "Z")

8 Normalized data processing

We will perform a scale of data:

points_scaled <- scale(points)
ress <- optics(points_scaled, minPts = min_members)
#optimal value obtained
blo_scans <- extractDBSCAN(ress, eps_cl = 0.025)
mm$cluster_id <- blo_scans$cluster
mm5 <- get_elements_in_m5_groups(mm)

Again we can do the same for scaled data:

eps_sequence_test <- seq(0.035, 0.06, 0.005)
x_stats <- extract_stats_dbscan(eps_sequence_test, ress)
## [1] "Extracting DBSCAN stats for epsI=0.035"
## [1] "Extracting DBSCAN stats for epsI=0.04"
## [1] "Extracting DBSCAN stats for epsI=0.045"
## [1] "Extracting DBSCAN stats for epsI=0.05"
## [1] "Extracting DBSCAN stats for epsI=0.055"
## [1] "Extracting DBSCAN stats for epsI=0.06"
print_global_stats(x_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.707317251366059 0.809833422563758 0.869658392011775 0.882637162432709 0.914770226414963 0.936867074202601"
## [1] "Purity 0.759706171280357 0.721851866461037 0.688244775361701 0.65155829182502 0.606613692706431 0.538626956746148"
## [1] "Groups 77 92 103 128 146 166"
## [1] "Clusters 84 94 105 130 148 168"
## [1] "EPS 0.035 0.04 0.045 0.05 0.055 0.06"
## [1] "True Groups 92"
## [1] "Und. Groups 34 29 22 22 23 25"
## [1] "Complete gr.: 66 82 99 122 141 164"
## [1] "Pure gr.: 56 56 57 64 60 44"
## [1] "P.+ C. gr.: 42 47 51 56 54 43"
## [1] "Real groups: 65 65 72 72 71 69"
## [1] "Fr: 0.494117647058824 0.494736842105263 0.481132075471698 0.427480916030534 0.36241610738255 0.254437869822485"
## [1] "Fp: 0.658823529411765 0.589473684210526 0.537735849056604 0.488549618320611 0.402684563758389 0.260355029585799"
## [1] "FC: 0.776470588235294 0.863157894736842 0.933962264150943 0.931297709923664 0.946308724832215 0.970414201183432"
## [1] "Spurious: 3.41666666666667 4.23404255319149 4.79047619047619 5.03846153846154 5.63513513513514 6.49404761904762"
## [1] "Bad class: 3.52380952380952 1.68085106382979 1.20952380952381 0.884615384615385 0.635135135135135 0.482142857142857"
## [1] "Recovery: 0.450408401400233 0.41656942823804 0.420070011668611 0.411901983663944 0.337222870478413 0.283547257876313"

Results look sligthly better when all variables are scaled to a mean=0, sd=1. The optimal value of eps gives more than 81% and 72% for completeness and purity respectively .

The same plots before

seleced_eps <- 0.04
plot_purity_completeness(
  eps_sequence_test, 
  x_stats$purity_list, 
  x_stats$completeness_list, 
  x_stats$recovery,
  c('Purity', 'Completeness', 'Recovery'),
  "Purity/completeness on scaled data",
  seleced_eps, 'eps', 'Percentage'
)

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$und_gr,
  x_stats$real_list, 
  true_groups - 0,
  c('Undetected', 'Groups', 'True groups'),
  "Total detection on scaled data",
   seleced_eps , 'eps', 'Groups number'
)

8.1 Elbow method

We have selected that minPts = min_members, which is a reasonable value for interpreting a group / clustering.

From the DBSCAN theory we can use the elbow on:

kNNdistplot(x = points_scaled, k = min_members)
abline(h =seleced_eps, lty = 3) 

How reader can see, the “elbow method” does not apply here, the reason for this is that elbow method try to optimize variances in clusters. Rather, we are trying to detect virialized galaxy groups, but algorithms are not intrinsically designed to account for the specific density profiles of dark matter halos.

8.2 Detected/Undetected original groups

The element whose groups were not detected are:

blo_scans <- extractDBSCAN(ress, eps_cl = seleced_eps)
mm$cluster_id <- blo_scans$cluster

all <- execute_stats(mm, blo_scans)

undetected <- get_elements_not_in_groups(mm5, all)
detected <- get_elements_in_groups(mm, all)

print(sprintf('Undetected groups: %s', length(unique(undetected$GROUP_ID))))
## [1] "Undetected groups: 29"
print(sprintf('Detected groups: %s', length(unique(detected$GROUP_ID))))
## [1] "Detected groups: 92"

Plot detected groups

plot3d(detected$x, detected$y, detected$z, 
       col = detected$GROUP_ID +1, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

Plot undetected groups

plot3d(undetected$x, undetected$y, undetected$z, 
       col = undetected$GROUP_ID, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

9 s-Distances (SLos application)

We will use the code to calculate the s-distance, which consists in make a elongation along the line of sight.

9.1 sDistances Matrix Calculation

We can calculate distance matrix to apply OPTICS to this new metric, executing following code may take a long time:

a<- mm[,c('x', 'y', 'z', 'dist', 'redshift')]
distance_matrix <- get_matrix_of_distances(a)
#Following code last a long time to execute
dist_object <- as.dist(distance_matrix)

9.2 OPTICS

Apply OPTICS algoritim and then obtain a reachability plot:

sres <- optics(dist_object, minPts = min_members)
plot(sres)

9.3 DBSCAN:

Extract DBSCAN:

sblo_scan <- extractDBSCAN(sres, eps_cl = 0.00025)
mm$cluster_id <- sblo_scan$cluster
#mm5 <- get_elements_in_groups(mm)
plot(sblo_scan)

We obtain a notable improvement comparing previous results:

sall <- execute_stats(mm, sblo_scan)
print_stats(sall)
## [1] "Mean purity 0.525621889527292"
## [1] "Mean completness 0.988224637681159"
## [1] "Sum. recovery  0.422403733955659"
## [1] "Undetected groups 21 out of 92"
## [1] "Detected pure and complete groups 13  out of 92"
## [1] "Detected real groups  71  out of 92"

Test several hyper-parameters:

#optimal value found at 0.00015
eps_sequence_test <- seq(0.000095, 0.00020, 0.00005)
sx_stats <- extract_stats_dbscan(eps_sequence_test, sres)
## [1] "Extracting DBSCAN stats for epsI=9.5e-05"
## [1] "Extracting DBSCAN stats for epsI=0.000145"
## [1] "Extracting DBSCAN stats for epsI=0.000195"
print_global_stats(sx_stats, eps_sequence_test)
## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.817313736056153 0.941412251447601 0.983205128205128"
## [1] "Purity 0.885230601527815 0.74739358371155 0.626339135997776"
## [1] "Groups 96 133 160"
## [1] "Clusters 100 134 160"
## [1] "EPS 9.5e-05 0.000145 0.000195"
## [1] "True Groups 92"
## [1] "Und. Groups 11 9 12"
## [1] "Complete gr.: 92 132 160"
## [1] "Pure gr.: 91 91 72"
## [1] "P.+ C. gr.: 83 90 72"
## [1] "Real groups: 85 84 80"
## [1] "Fr: 0.821782178217822 0.666666666666667 0.447204968944099"
## [1] "Fp: 0.900990099009901 0.674074074074074 0.447204968944099"
## [1] "FC: 0.910891089108911 0.977777777777778 0.993788819875776"
## [1] "Spurious: 1.35 2.91044776119403 4.5"
## [1] "Bad class: 2.33 0.417910447761194 0.08125"
## [1] "Recovery: 0.856476079346558 0.851808634772462 0.619603267211202"

We get to the best value:

seleced_eps <- 0.000108
plot_purity_completeness(
  eps_sequence_test, 
  sx_stats$purity_list, 
  sx_stats$completeness_list, 
  sx_stats$recovery,
  c('Purity', 'Completeness', 'Recovery'),
  "Purity/completeness on sLOS data",
  seleced_eps, 'eps', 'Percentage'
)

There is a notable improvement of the results with the sLos distances.

Finally, for groups detection:

plot_purity_completeness(
  eps_sequence_test, 
  sx_stats$real_list/true_groups, 
  sx_stats$und_gr/true_groups, 
  sx_stats$pure_complet/true_groups,
  c('Detected', 'Undetected gr.', 'Pure + Complet.'),
  "Group global % detection stats on sLOS data",
  seleced_eps , 'eps', 'Percentage'
)

plot_purity_completeness(
  eps_sequence_test, 
  sx_stats$und_gr,
  sx_stats$real_list, 
  true_groups - 0,
  c('Undetected', 'Groups', 'True groups'),
  "Total detection on sLOS data",
   seleced_eps , 'eps', 'Groups number'
)

Detected and undetected groups:

sblo_scan <- extractDBSCAN(sres, eps_cl = seleced_eps)
mm$cluster_id <- sblo_scan$cluster

all <- execute_stats(mm, sblo_scan)

undetected <- get_elements_not_in_groups(mm5, all)
detected <- get_elements_in_groups(mm, all)

print(sprintf('Undetected groups: %s', length(unique(undetected$GROUP_ID))))
## [1] "Undetected groups: 12"
print(sprintf('Detected groups: %s', length(unique(detected$GROUP_ID))))
## [1] "Detected groups: 105"

Plot for detected groups:

plot3d(detected$x, detected$y, detected$z, 
       col = detected$GROUP_ID +1, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

Plot for not detected groups:

plot3d(undetected$x, undetected$y, undetected$z, 
       col = undetected$GROUP_ID +1, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

10 Summary and conclusions

The bests results obtained with each method we summarize in next table:

Method Data Sample Outcomes Conclusion
OPTICS Non-scaled - Good in cluster reachability plot
OPTICS Xi hierarchical method Non-scaled Not applicable. Not valuable result were found.
DBSCAN Non-scaled

P: 0.65

C: 0.87

R: 0.42

U: 21

Good in cluster detection but low recovery. Which indicates the groups are not globally recovered.
HDBSCAN Non-scaled - Not good in cluster detection.
DPC Non-scaled 97% group-center detection.

Good in group center detection, but detected centers do not match original groups.

\(\delta=0.00084\) and \(\rho=0.9985\)

sOPTICS Non-scaled

P: 0.84

C: 0.84

R: 0.86

U: 12

Good on group detection with

eps=0.000108

OPTICS Scaled - Good in cluster reachability plot
DBSCAN Scaled

P: 0.72

C: 0.81

R: 0.41

U: 22

Good in cluster detection with

eps = 0.04 but low recovery. Which indicates the groups are not globally recovered.

11 Executive summary

In conclusion. We took for this study a sample from the 2dFGRS catalog and selected 6173 objects within an area between 0 and 2h in RA and -25 and -36º in DEC. We tried several methods in groups detecting selecting groups with a minimum of 5 elements (galaxies), we got 95 groups.

We can stand out the following points:

Finally we compare the original with the best clustering obtained in this study:

Actual groups distribution Best cluster detection with sOPTICS.