1 The clustering problem on galaxy dataset
- 1.1 Galaxy catalog for Real-SDSS
2 Generic functions for assessments
- 2.1 Functions to asses outcomes
3 Data Preprocessing
- 3.1 Descriptive analysis and visualization
4 Raw data processing
- 4.1 OPTICS
- 4.2 DBSCAN
5 HDBSCAN data processing
- 5.1 Output-clusters visualization
6 Density Peaks Clustering(DPC )
- 6.1 Output-clusters visualization
7 Normalized data processing
- 7.1 Elbow method
- 7.2 Detected/Undetected original groups
8
9 Summary and conclusions
10 Executive summary

1 The clustering problem on galaxy dataset

Rather than pursuing a traditional unsupervised clustering optimization—such as variance maximization via the Elbow Method—this work evaluates the efficacy of density-based clustering algorithms in replicating a physically motivated halo-based group finder.

We aim to identify the optimal hyperparameter configurations that allow these algorithms to recover the underlying dark matter halo membership of galaxies, effectively using the halo-based catalog as a benchmark for physical validity.

Loading data files

1.1 Galaxy catalog for Real-SDSS

## [1] "C:\\desarrollo\\astro\\tfm\\data\\sdss\\"

## 'data.frame':    396068 obs. of  9 variables:
##  $ GAL_ID  : int  35940 35943 35951 35953 35955 35956 35959 35962 35978 35979 ...
##  $ ra      : num  241 242 242 243 244 ...
##  $ dec     : num  -0.41692 -0.41969 0.00057 0.00268 0.00241 ...
##  $ x       : num  -0.0146 -0.0226 -0.027 -0.0356 -0.0283 ...
##  $ y       : num  -0.0264 -0.0432 -0.0501 -0.0714 -0.0579 ...
##  $ z       : num  -2.19e-04 -3.57e-04 5.67e-07 3.73e-06 2.71e-06 ...
##  $ redshift: num  0.0304 0.0493 0.0577 0.0813 0.0654 ...
##  $ dist    : num  0.0302 0.0487 0.0569 0.0798 0.0644 ...
##  $ GROUP_ID: int  68415 68416 2231 23723 68418 68419 68420 68422 68423 68424 ...

2 Generic functions for assessments

2.1 Functions to asses outcomes

To assess the outcomes we will based our analysis on tree basic concepts:

Purity(P): measure of output-cluster: proportion of members coming exclusively from a single true group, providing confidence that the algorithm correctly groups members together. A high purity rate indicates the algorithm’s effectiveness in identifying true groups.

Completeness (C): measure of an output-cluster: proportion of data true-group elements included in an output-cluster. A cluster is “complete” if contains all points of the original true group.

Recovery (R): measure of an output-cluster (much more restrictive): proportion of output-clusters which are both pure and complete. For this study we consider an output-cluster to be pure if P>=0.66 (at least 2/3 of elements of an output-cluster belong to a single group). An original cluster is complete if C>=0.5 (at least half data belong to an original true group).

Some other stats are:

total_in_group: number of elements in a given group.

total_in_cluster: number of elements in a given output-cluster.

total_in_cluster_group: number of elements in a given output-cluster belonging to a majority-group.

undetected_groups: original groups not detected as majority-group in output-clusters.

detected_groups: original groups detected as majority-group in output-clusters.

Following code is aimed to asses the outcomes obtained with from an given output-cluster:

sys.source(sprintf("%s\\assess.r", includes_folder), envir = knitr::knit_global())

Following code is intended to calculate stats

sys.source(sprintf("%s\\calculate_stats.r", includes_folder), envir = knitr::knit_global())

We generated some functions to help us visualize the results graphically:

sys.source(sprintf("%s\\plotting_functions.r", includes_folder),  envir = knitr::knit_global())

Last file to include content function to compute slos distance:

sys.source(sprintf("%s\\slos.r", includes_folder),  envir = knitr::knit_global())

3 Data Preprocessing

3.1 Descriptive analysis and visualization

There is an initial preprocessing of data-file in order to obtain proper distances and cartesian coordinates x,y,z, by now it is omitted here.

Take a sample bounded by minPts = 5 and RA and DEC:

\[RA \in [90, 150],\, DEC \in [15, 25],\, \,and\,\, z \lt max\_redshift\]

Then we take the initial values and take a look at the remaining distribution:

min_members <- 5

ra_lim_inf <- 90
ra_lim_sup <- 150

dec_lim_inf <- 15
dec_lim_sup <-  25

max_redshift <- 0.08
min_redshift <- 0.025

# Take a sample using boundaries
mm <- dt[ dt$ra<= ra_lim_sup & dt$ra>=ra_lim_inf & 
            dt$dec>= dec_lim_inf & dt$dec<=dec_lim_sup &
            dt$redshift< max_redshift & dt$redshift> min_redshift,]


# mm is an object containing both groups and galaxy identification
ggplot(mm, aes(x=redshift, y=redshift))+geom_violin()

dim(mm)

## [1] 7328    9

Select groups with more than min_members members queries:

h<-sqldf("select 
            count(GAL_ID) as members, 
            GROUP_ID 
          from mm 
          group by GROUP_ID 
          order by  members desc")

mm5<-sqldf(sprintf("
    SELECT 
        mm.GAL_ID,
        mm.x, 
        mm.y, 
        mm.z, 
        mm.GROUP_ID, 
        mm.redshift, 
        mm.dist
      FROM 
        mm as mm, h 
      where 
          mm.GROUP_ID=h.GROUP_ID and 
          h.members >= %s"
      , min_members))

get_elements_in_m5_groups <- function(mm){
    groups_in_mm5<-sqldf(sprintf("
          select
              mm.GAL_ID,
              mm.x, 
              mm.y, 
              mm.z, 
              mm.GROUP_ID, 
              mm.redshift, 
              mm.dist,  
              mm.cluster_id 
          from 
              mm as mm, h 
        where 
            mm.GROUP_ID=h.GROUP_ID and 
              h.members >= %s", min_members))
    groups_in_mm5
}

#tt<-sqldf(c("
#       UPDATE mm
#       set group_id=0
#       WHERE group_id NOT IN (select distinct(group_id) from mm5)
#       ", "select * from main.mm"))

Then use it to find the target data:

true_groups <- length(unique(mm5$GROUP_ID))
number_non_isolated_galaxies <- dim(mm5)[1]
number_isolated_galaxies <- dim(mm)[1] - dim(mm5)[1]
print(
  sprintf('Number of galaxies in groups with more than %s elements %s out of %s, aprox %s percent', 
          min_members, 
          number_non_isolated_galaxies, 
          dim(mm)[1], format(number_non_isolated_galaxies * 100
                                /dim(mm)[1], digits=4)))

## [1] "Number of galaxies in groups with more than 5 elements 1265 out of 7328, aprox 17.26 percent"

print(sprintf("Number of groups with more than %s members: %s", 
              min_members, 
              true_groups))

## [1] "Number of groups with more than 5 members: 95"

We take a look at the groups with more than min_members:

hhh<- h[h$members>=min_members, ]
ttable <- table(hhh$GROUP_ID, hhh$members)
barplot(ttable, col=('red'), 
        main=sprintf("Group distribution (%s) with at least %s members ", 
                     true_groups, 
                     min_members))

And the number of members for each group distribution:

boxplot(hhh$members, main="Boxplot of num of members in groups")

We finally update the groups in order to get the final target dataset:

#temp <- suppressWarnings(sqldf(c("
#      update mm set GROUP_ID=0
# where 
#       GROUP_ID not in (select distinct(GROUP_ID) from mm5)", "select * from main.mm")))
#temp

#mm_real <- temp

Lets take a look at the complete target sample distribution in the space:

plot3d(mm$x, mm$y, mm$z, col = 'black', 
       size = 1, xlab = "X", ylab = "Y", zlab = "Z")

The groups colored in this space:

aa <- sqldf("select 
          GAL_ID,
          x, 
          y, 
          z, 
            case 
              when group_id IN (Select GROUP_ID from mm5) then group_id
          else 0
            end as group_id, 
          redshift, 
          dist
      from 
            mm")

plot3d(aa$x, aa$y, aa$z, col = aa$group_id+1, 
       size = 2, xlab = "X", ylab = "Y", zlab = "Z")

And the groups with more than min_members:

plot3d(mm5$x, mm5$y, mm5$z, col = mm5$GROUP_ID, 
       size = 2, xlab = "X", ylab = "Y", zlab = "Z")

4 Raw data processing

We will process the data without any kind of scale or normalization.

4.1 OPTICS

Optics clustering:

points<- mm[,c('x', 'y', 'z')]

#clustering
res <- optics(points, minPts = min_members)
plot(res)

In the previous plot we can see how OPTICS modeling valleys (clusters) and the peaks (cluster-separation).

Next, we try the extractXi function which allows to extract hierarchical clusters, execute with \(\xi=0.3\):

optics <- extractXi(res, xi=0.15)
plot(optics)

4.1.1 Output-clusters visualization

Take a plot of the clustering obtained:

plot3D_cluster(optics, mm)

The group obtained is not very accurate.

Stats with different values:

mm$cluster_id <- as.numeric(optics$cluster)
mm5 <- get_elements_in_m5_groups(mm)
all <- execute_stats(mm, optics)
print_stats(all)

## [1] "Mean purity 0.491136774752794"
## [1] "Mean completness 0.944621753671659"
## [1] "Sum. recovery  0.735177865612648"
## [1] "Undetected groups 16 out of 95"
## [1] "Detected pure and complete groups 13  out of 95"
## [1] "Detected real groups  86  out of 95"

4.2 DBSCAN

We can directly apply extractDBSCAN on the OPTICS model:

blo_scan <- extractDBSCAN(res, eps_cl = 0.00075)
mm$cluster_id <- blo_scan$cluster
mm5 <- get_elements_in_m5_groups(mm)
plot(blo_scan)

4.2.1 Output-clusters visualization

We have, on one hand all groups with more than min_members members (made up by a reduced amount of galaxies from catalog) , in the other hand we have the output-clusters from DBSCAN:

plot3D_cluster(blo_scan, points)

all <- execute_stats(mm, blo_scan)
head(all, 5)

##   cluster_id group_id total_in_group total_in_cluster total_in_cluster_group
## 1          1   229735              1                7                      1
## 2          2    17087              3               20                      3
## 3          9    36542              2                5                      2
## 4        187      442             30               93                     30
## 5          6    16267              3               11                      3
##      purity completn spurious bad_class is_pur is_comp recovery is_real
## 1 0.1428571        1        6         0      0       1        0       0
## 2 0.1500000        1       17         0      0       1        0       0
## 3 0.4000000        1        3         0      0       1        0       0
## 4 0.3225806        1       63         0      0       1        0       1
## 5 0.2727273        1        8         0      0       1        0       0

print_stats(all)

## [1] "Mean purity 0.360264347276887"
## [1] "Mean completness 0.994218055128319"
## [1] "Sum. recovery  0.203162055335968"
## [1] "Undetected groups 35 out of 95"
## [1] "Detected pure and complete groups 13  out of 95"
## [1] "Detected real groups  60  out of 95"

We can now test over different values in order to obtain optimal eps_cl hyper-parameter:

#It is easy to transform onto a function which admits a sequence and a res set.
eps_sequence_test <- seq(0.0001, 0.0007, 0.0002)
x_stats <- extract_stats_dbscan(eps_sequence_test, res)

## [1] "Extracting DBSCAN stats for epsI=1e-04"
## [1] "Extracting DBSCAN stats for epsI=3e-04"
## [1] "Extracting DBSCAN stats for epsI=5e-04"
## [1] "Extracting DBSCAN stats for epsI=7e-04"

Show the results obtained:

print_global_stats(x_stats, eps_sequence_test)

## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.647568937957248 0.923104626389339 0.959515671074412 0.993282761388133"
## [1] "Purity 0.986190476190476 0.834008194550621 0.533189938946596 0.392136895064743"
## [1] "Groups 70 141 278 292"
## [1] "Clusters 75 145 282 293"
## [1] "EPS 1e-04 3e-04 5e-04 7e-04"
## [1] "True Groups 95"
## [1] "Und. Groups 27 6 11 28"
## [1] "Complete gr.: 55 139 274 292"
## [1] "Pure gr.: 74 115 85 44"
## [1] "P.+ C. gr.: 54 110 80 44"
## [1] "Real groups: 73 93 88 68"
## [1] "Fr: 0.710526315789474 0.753424657534247 0.282685512367491 0.149659863945578"
## [1] "Fp: 0.973684210526316 0.787671232876712 0.300353356890459 0.149659863945578"
## [1] "FC: 0.723684210526316 0.952054794520548 0.968197879858657 0.993197278911565"
## [1] "Spurious: 0.0533333333333333 1.2551724137931 4.72695035460993 10.2354948805461"
## [1] "Bad class: 8.52 1.40689655172414 0.904255319148936 0.0955631399317406"
## [1] "Recovery: 0.439525691699605 0.987351778656126 0.619762845849802 0.256916996047431"

Lets get a plots for completeness and purity:

We have the optimal point at \(\epsilon= 0.0003\), where recovery =98.7% and completeness= 92% purity=83%:

seleced_eps<- 0.0003
plot_purity_completeness(
  eps_sequence_test, 
  x_stats$purity_list, 
  x_stats$completeness_list, 
  x_stats$recovery,
  c('Purity', 'Completeness', 'Recovery'),
  "Purity/completeness on raw data",
  seleced_eps, 'eps', 'Percentage'
)

This chart shows that optimal point is around \(0.0003\). Is at this value where completeness is maximum and purity is still high.

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$real_list/true_groups, 
  x_stats$und_gr/true_groups, 
  x_stats$pure_complet/true_groups,
  c('Detected', 'Undetected gr.', 'Pure + Complet.'),
  "Group global % detection stats on raw data",
  seleced_eps , 'eps', 'Percentage'
)

According with previous chart, the optimal value is \(\epsilon = 3.10^{4}\). The number of undetected groups remains at minimum.

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$und_gr,
  x_stats$real_list, 
  true_groups - 0,
  c('Undetected', 'Groups', 'True groups'),
  "Total detection on raw data",
   seleced_eps , 'eps', 'Groups number'
)

Once again the optimal point is at \(\epsilon = 3.10^{4}\).

5 HDBSCAN data processing

As said from theory, HDBSCAN does not generate a great model because it ability to detect clusters in sparse areas. It cause detect noise as clusters.

cl <- hdbscan(points, minPts = 5)
length(unique( cl$cluster))

## [1] 415

5.1 Output-clusters visualization

plot3D_cluster(cl, mm)

HDBSCAN do not work pretty well because it detects cluster in sparser areas which gives as a result cluster detection on noise regions.

6 Density Peaks Clustering(DPC )

Alex Rodriguez and Alessandro Laio (2014).

https://github.com/thomasp85/densityClust

By making this way it appears some clusters:

points<- mm[,c('x', 'y', 'z')]
galaxyDens <- densityClust(points)
galaxyClusters <- findClusters(galaxyDens, rho=0.997, delta=0.00086)
mm$cluster_id <- galaxyClusters$cluster
plot(galaxyClusters)
abline(h = 0.997, lty = 3)

# do not use this takes a lot!!:
#plotMDS(galaxyClusters)

all <- execute_stats(mm, galaxyClusters)
print_stats(all)

## [1] "Mean purity 0.644637534882506"
## [1] "Mean completness 0.991608628109481"
## [1] "Sum. recovery  0.766798418972332"
## [1] "Undetected groups 25 out of 95"
## [1] "Detected pure and complete groups 13  out of 95"
## [1] "Detected real groups  79  out of 95"

DPC is analogous to HDBCAN: the model do not fit well for the same reason: detecting clusters in noise regions.

May be DPC is not good in finding cluster but it can be useful in finding the centers by finding the peaks of density, as we show bellow:

galaxyClusters <- findClusters(galaxyDens, rho=0.9985, delta=0.00084)
peaks <- mm5[galaxyClusters$peaks,]
print(sprintf("DPC detected %s out of %s clusters", length(unique(peaks$GROUP_ID)), true_groups))

## [1] "DPC detected 20 out of 95 clusters"

6.1 Output-clusters visualization

R DPC library generate some clusters distribution that we can take a look:

plot3D_cluster(galaxyClusters, mm5)

DPG is not good in create a model of clustering, but it can be good detecting the peaks for each group, for example:

rho = 0.9985
delta = 0.002
galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
plot(galaxyClusters)
abline(h = delta, lty = 3) 
abline(v = rho, lty = 3)

We can obtain

update_mm5 <- function() {
mm5<-sqldf(sprintf("
    SELECT 
        mm.GAL_ID,
        mm.x, 
        mm.y, 
        mm.z, 
        mm.GROUP_ID, 
        mm.redshift, 
        mm.dist, mm.cluster_id
      FROM 
        mm as mm, h 
      where 
          mm.GROUP_ID=h.GROUP_ID and 
          h.members >= %s"
                   , min_members))
  return (mm5)
}
mm$cluster_id <- galaxyClusters$cluster
mm5 <- update_mm5()

print(sprintf("GRUPOS %s mientras que clusters %s" , length(unique(mm5$GROUP_ID)), length(unique(mm5$cluster_id))))

## [1] "GRUPOS 95 mientras que clusters 40"

galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
mm$cluster_id <- galaxyClusters$cluster

We can go playing with rho and deltas in order to obtain maximize the number of centers detected:

rho = 0.99912
delta = 0.00035
galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
plot(galaxyClusters)
abline(h = delta, lty = 3) 
abline(v = rho, lty = 3)

mm$cluster_id <- galaxyClusters$cluster
mm5 <- update_mm5()
print(sprintf("GRUPOS %s mientras que clusters %s out of %s" , length(unique(mm5$GROUP_ID)), length(unique(mm5$cluster_id)), length(unique(mm$cluster_id))))

## [1] "GRUPOS 95 mientras que clusters 66 out of 143"

The best detected until now was the

rho = 0.9975
delta = 0.0009
"Groups 95 whereas clusters 79" => 0.8315

rhos <- seq(0.9980, 0.999, 0.0001)
deltas <- seq(0.0008, 0.0009, 0.00001)

ll <-  length(deltas)
j<-1
i<-ll

matrix_data=matrix(ncol=length(rhos), nrow = length(deltas)) 
for (rho in rhos){
    for (delta in deltas){
        galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
        peaks <- mm[galaxyClusters$peaks,]
        l <- length(unique(peaks$GROUP_ID))
        #print(min(l/true_groups, true_groups/l))
        matrix_data[i, j] <- min(l/true_groups, true_groups/l)
        i <- i-1
    }
    j <- j+1
    i<-ll
}


custom_heatmap(matrix_data, deltas, rhos, xTitle = "rho", yTitle = "delta", numColors = 11)

we can see , all groups-center are detected with DPC for values \(\delta=0.9986\) and \(\rho=0.00085\). We can obtain them as follows

rho=0.9986
delta=0.00085
galaxyClusters <- findClusters(galaxyDens, rho=rho, delta=delta)
peaks <- mm[galaxyClusters$peak,]
l <- length(peaks[peaks$GROUP_ID>0 &  !is.na(peaks$GROUP_ID), c('GROUP_ID')])
min(l/true_groups, true_groups/l)

## [1] 0.9793814

peaks <- na.omit(peaks)

plot3d(peaks$x, peaks$y, peaks$z, col = peaks$GROUP_ID+1, size = 3, xlab = "X", ylab = "Y", zlab = "Z")

As we can see, in delta=0.00085 and rho=0.9986 produce the best results with a 99% of group-center detection, however we can not conclude that all centers represent any original cluster, so the center detection did not work.

7 Normalized data processing

We will perform a scale of data:

points_scaled <- scale(points)
ress <- optics(points_scaled, minPts = min_members)
#optimal value obtained
blo_scans <- extractDBSCAN(ress, eps_cl = 0.025)
mm$cluster_id <- blo_scans$cluster
mm5 <- get_elements_in_m5_groups(mm)

Again we can do the same for scaled data:

eps_sequence_test <- seq(0.02, 0.035, 0.005)
x_stats <- extract_stats_dbscan(eps_sequence_test, ress)

## [1] "Extracting DBSCAN stats for epsI=0.02"
## [1] "Extracting DBSCAN stats for epsI=0.025"
## [1] "Extracting DBSCAN stats for epsI=0.03"
## [1] "Extracting DBSCAN stats for epsI=0.035"

print_global_stats(x_stats, eps_sequence_test)

## [1] "############### DATA FOR eps values ############"
## [1] "#############################################"
## [1] ""
## [1] ""
## [1] "Completeness 0.81009992021196 0.882137319917698 0.922845099254538 0.946536585242532"
## [1] "Purity 0.950413529297024 0.905999087377366 0.821432134640155 0.735867161290582"
## [1] "Groups 97 115 134 166"
## [1] "Clusters 103 118 136 168"
## [1] "EPS 0.02 0.025 0.03 0.035"
## [1] "True Groups 95"
## [1] "Und. Groups 9 7 8 7"
## [1] "Complete gr.: 91 112 131 164"
## [1] "Pure gr.: 97 104 101 105"
## [1] "P.+ C. gr.: 85 98 96 101"
## [1] "Real groups: 92 91 89 90"
## [1] "Fr: 0.817307692307692 0.823529411764706 0.700729927007299 0.597633136094675"
## [1] "Fp: 0.932692307692308 0.873949579831933 0.737226277372263 0.621301775147929"
## [1] "FC: 0.875 0.941176470588235 0.956204379562044 0.970414201183432"
## [1] "Spurious: 0.300970873786408 0.627118644067797 1.30147058823529 1.91666666666667"
## [1] "Bad class: 5.59223300970874 1.98305084745763 1.22794117647059 0.880952380952381"
## [1] "Recovery: 0.904347826086957 0.962845849802372 0.954940711462451 0.961264822134387"

Results look quite better when all variables are scaled to a mean=0, sd=1.

The same plots as before:

seleced_eps <- 0.025
plot_purity_completeness(
  eps_sequence_test, 
  x_stats$purity_list, 
  x_stats$completeness_list, 
  x_stats$recovery,
  c('Purity', 'Completeness', 'Recovery'),
  "Purity/completeness on scaled data",
  seleced_eps, 'eps', 'Percentage'
)

plot_purity_completeness(
  eps_sequence_test, 
  x_stats$und_gr,
  x_stats$real_list, 
  true_groups - 0,
  c('Undetected', 'Groups', 'True groups'),
  "Total detection on raw data",
   seleced_eps , 'eps', 'Groups number'
)

7.1 Elbow method

We have selected that minPts = min_members, which is a reasonable value for interpreting a group / clustering.

From the DBSCAN theory we can use the elbow on:

kNNdistplot(x = points_scaled, k = min_members)
abline(h = seleced_eps, lty = 3)

How reader can see, the “elbow method” does not apply here, the reason for this is that elbow method try to optimize variances in clusters. Rather, we are trying to detect virialized galaxy groups, but algorithms are not intrinsically designed to account for the specific density profiles of dark matter halos.

7.2 Detected/Undetected original groups

The element whose groups were not detected are:

blo_scans <- extractDBSCAN(ress, eps_cl = seleced_eps)
mm$cluster_id <- blo_scans$cluster

all <- execute_stats(mm, blo_scans)

undetected <- get_elements_not_in_groups(mm5, all)
detected <- get_elements_in_groups(mm, all)

print(sprintf('Undetected groups: %s out of %s', 
              length(unique(undetected$GROUP_ID)), true_groups))

## [1] "Undetected groups: 7 out of 95"

print(sprintf('Detected groups: %s out of %s', 
              length(unique(detected$GROUP_ID)), true_groups))

## [1] "Detected groups: 115 out of 95"

Plot detected groups

plot3d(detected$x, detected$y, detected$z, 
       col = detected$cluster_id +1, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

Plot undetected groups

plot3d(undetected$x, undetected$y, undetected$z, 
       col = undetected$GROUP_ID, size = 2, 
       xlab = "X", ylab = "Y", zlab = "Z")

8 9 Summary and conclusions

The bests results obtained with each method:

Method	Data Sample	Outcomes	Conclusion
OPTICS	Non-scaled	-	Good in cluster reachability plot
OPTICS Xi hierarchical method	Non-scaled	Not applicable	Not valuable result were found.
DBSCAN	Non-scaled	P: 0.83 (79%) C: 0.92 (95%) R: 99 (75%) U: 6	Optimal point for eps= 0.0003
HDBSCAN	Non-scaled	Not valuable results.	Not good in cluster detection.
DPC	Non-scaled	100% in groups-center detection	Good in group center detection, but detected centers do not match original groups.. \(\delta=0.00085\) and \(\rho=0.9986\)
OPTICS	Scaled	-	Good in cluster reachability plot
DBSCAN	Scaled	P: 0.88 (87%) C: 0.90 (96%) R: 0.96 (82%) U: 7	eps=0.025

10 Executive summary

In conclusion. We took for this study a sample from the SDSS-DR7 Re-Real Spacecatalog and selected 7328objects within an area between 90 and 150h in RA and 15 and 25º in DEC. We tried several methods in groups detecting selecting groups with a minimum of 5 elements (galaxies), we got 95 groups.

We can stand out the following points:

The method sOPTICS with DBSCAN reached the highest values in completeness, purity and recovery and also in detecting groups.
The DPC algorithm worked perfect in group-center detection (reached 100%), but detected centers do not match original groups.
A normalization or scaled of data data did not improve results, this is because all data are in fact in approx. in same scale, given that all coordinates: x, y, z and dist are all of them expressed under same units and in same range of values.

Finally we compare the original with the best clustering obtained in this study:

Actual groups distribution in Re-Real Space	Best cluster detection with DBSCAN.

Estudios de Informática, Multimedia y Telecomunicaciones

TFM-Density algorithms applied to SDSS Re-real space

Carlos Toro Peñas

2025-12-15