Comparison of Different Clustering Algorithms for Diagnosing Memory-Related Performance Issues Using a Distributed Computing System

Postgraduate Thesis uoadl:1337687 563 Read counter

Unit:
Κατεύθυνση / ειδίκευση Διαχείριση Πληροφορίας και Δεδομένων (ΔΕΔ)
Πληροφορική
Deposit date:
2017-03-29
Year:
2017
Author:
Poulakakis Stylianos
Supervisors info:
Ευστάθιος Χατζηευθυμιάδης, Επίκουρος Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών ΕΚΠΑ
Original Title:
Comparison of Different Clustering Algorithms for Diagnosing Memory-Related Performance Issues Using a Distributed Computing System
Languages:
Greek
Translated title:
Comparison of Different Clustering Algorithms for Diagnosing Memory-Related Performance Issues Using a Distributed Computing System
Summary:
Failures in popular systems of technological giants illustrate load testing is a necessary procedure for the quality of software systems. However, the diagnosis of memory-related issues is a major challenge for developers. To address them, they often apply automated analysis techniques which require considerable manual effort and a high degree of system knowledge. One solution to this problem is the application of machine learning techniques to diagnose the existing abnormal system behavior. Mark D. Syer et al. propose a new automated approach combining performance counters and executing files by applying hierarchical clustering for clustering data. This grouping, however, fails in the case of large data sets as it generates greater complexity. We apply a different approach to the algorithm of Syer by using the Spark framework which offers parallelism of processes. Based on a previous corporate implementation of the algorithm, we apply the k-means algorithm in the clustering phase instead of the hierarchical clustering. This is done in order to evaluate the behavior of the two algorithms for large data sets and validate the k-means algorithm as part of the overall Syer approach. Our case studies use performance counters and execution logs from two systems. For the evaluation, we use synthetic data from one program created by Software Competitiveness International and actual data from the implementation of Apache Tomcat with an injection of a memory spike. Our approach identifies memory spikes corresponding to log lines with a high degree of precision. The approach detects a fairly accurate number of individual memory spikes or the clusters containing them. Finally, in the case of large data sets, the k-means algorithm performs better in terms of execution time and performance than hierarchical clustering.
Main subject category:
Technology - Computer science
Keywords:
Load Testing, Performance Counters, Execution Logs, Distributed Computing Systems, Cluster Analysis
Index:
Yes
Number of index pages:
6
Contains images:
Yes
Number of references:
97
Number of pages:
127
Diplomatiki_Final.pdf (2 MB) Open in new window