Noisy, Greedy and Not so Greedy k-Means++

Media type: Text; E-Article; Electronic Conference Proceeding

Title: Noisy, Greedy and Not so Greedy k-Means++

Contributor: Bhattacharya, Anup [Author]; Eube, Jan [Author]; Röglin, Heiko [Author]; Schmidt, Melanie [Author]

Published: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020

Language: English

DOI: https://doi.org/10.4230/LIPIcs.ESA.2020.18

Keywords: k-means++ ; adaptive sampling ; greedy

Origination:

Footnote: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.

Description: The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

Access State: Open Access

Search in field:

Recently searched for: