CloudGC: A Scalable Algorithm for Complex Graph Clustering


In this project, we study the problem of complex graph clustering. Partitioning a large graph is particularly challenging since an adjacency matrix of a graph may have trillions of entries, which exceeds the memory and storage size of a conventional PC. The cloud computing paradigm provides a natural solution to such a large scale problem. To address the difficulty of complex graph clustering, we propose a scalable MapReduce clustering algorithm, which is capable of handling a large graph; we implement our algorithm on a cloud. Experimental results demonstrate that our algorithm can cluster a Twitter social graph, which contains 41.7 million nodes, 1.47 billion links in just 8 hours. Our clustering results reveal interesting properties of the Twitter social network, such as small world property and heavy tail distribution of cluster sizes. Hence, our proposed scalable MapReduce clustering algorithm provides a valuable tool for social network analysis, biological network analysis, etc., and may have a profound impact on research in network science.


  • Clustering tweets streams according to topics
  • Clustering searched tweets according to topics
  • Recommend tweets to read
  • Current trends, content prefetching

System Architecture

Clustering Algorithm

Label Propagation

  • Refer to: Usha Nandini Raghavan, Réka Albert, and Soundar Kumara, Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. (2007)
  • Uses only the network structure to guide the community detection.
  • Does not optimize any global measure or function.
  • Each node finds a neighboring community of largest size to join until the system reaches equilibrium.

Evaluate the Correctness

Real Applications

Performing large scale community detection for Twitter users

  • Using Label Propagation to determine communities.
  • Twitter user network containing 40 million users and 1.4 billion connections.

Clustering tweets streams or searches tweets according to topics

  • Using Label Propagation to determine number of topics.
  • Using Spectrum Clustering to perform clustering.