r - Clustering with PAM not isolating clusters but T-SNE shows well formed clusters -


i building clustering algorithm use data have not yet seen, i'm using psuedo data in mean time. results pam show not have isolated clusters ggplot using tsne shows have well-formed clusters. suspect due fake data. have thoughts why be?

here data , please note, age , howold represent different things:

library(dplyr) library(cluster) library(rtsne) library(ggplot2)  set.seed(1987) n = 350 clust_dat <-  data.frame(personid = 1:n,          networkpref = sample(c("topic", "jobtitle", "orgtype"),                              size = n, replace = true,                              prob = c(0.56, 0.20, 0.24)),          age = sample(23:65, size = n, replace = true),          familyimp = sample(c(1, 2, 3, 4, 5), size = n, replace = true,                              prob = c(0.02, 0.01, 0.10, 0.4, 0.83)),          howold = sample(25:30, size = n, replace = true,                          prob = c(.40, .30, .20, .05, .03, .02)),          horror = sample(c("yes", "no"), size = n, replace = true,                           prob = c(0.27, 0.73)),          sailboat = sample(c("yes", "no"), size = n, replace = true,                             prob = c(0.58, 0.42))) 

here model build after first defining levels of ordinal variable

clust_dat$familyimp <- factor(clust_dat$familyimp,                            levels = c("1", "2", "3", "4", "5"),                            ordered = true)  gower_dist <- daisy(clust_dat[, -1], metric = "gower") gower_matrix <- as.matrix(gower_dist)  #find silhouette width many pam models sil_width <- c(na) (i in 2:ceiling(nrow(clust_dat)/9)) {    pam_fit <- pam(gower_dist,                   diss = true,                  k = i)   sil_width[i] <- pam_fit$silinfo$avg.width }  #build pam model best silhouette width pam_fit <- pam(gower_dist, diss = true, k = which.max(sil_width)) 

when getting isolation info on pam, get:

pam_fit$isolation   1  2  3  4  5  6  7  8  9 10 11 12  no no no no no no no no no no no no  levels: no l l* 

but plotting shows some formed clusters

tsne_obj <- rtsne(gower_dist, is_distance = true)  tsne_data <-    tsne_obj$y %>%   data.frame() %>%   setnames(c("x", "y")) %>%   mutate(cluster = factor(pam_fit$clustering),          name = clust_dat$personid)  ggplot(tsne_data, aes(x = x, y = y)) +   geom_point(aes(color = cluster)) 

any ideas? if remove continuous variables non-defined clusters recognized isolated...

the way generate data, should not have clusters beyond artifact categoricial labels used. based on frequencies used, have expect 8 "clusters" corresponding trivial combinations of attributes.

if generate i.i.d. data, not supposed cluster!

so i'd rather assume visualization problem.

see, e.g., this answer on problems of "seeing" clusters in tsne.


Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -