Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Scalable High-dimensional Index Design for Code Search Systems

Title: Scalable High-dimensional Index Design for Code Search Systems

Authors: Mu-Woong Lee

Date Issued: 2012

Publisher: 포항공과대학교

Abstract: This research addresses the problem of supporting scalable code similarity search systems for large-scale software repositories. While there are commercial code search engines available, they treat software as text and often fail to find semantically related code. Meanwhile, existing tools for semantic code clone searches take a “post-mortem” approach involving the detection of clones “after” the code development is completed, and hence, fail to return the results instantly. In clear contrast, the goal of this research is to combine the strength of these two lines of existing research.To achieve this goal, an indexing structure on vector abstractions of code is proposed. This index utilizes dimension reduction techniques to efficiently deal with the vector abstractions, which are naturally high-dimensional. This search system is then integrated into real-world development sessions. Such integration suggests that, by posing every code segment as a query to the software code corpus, developers can instantly reference relevant code segments at the time of generation to enhance productivity. This integration scenario creates the need for efficient similarity searches with the following requirements. First, a developer session translates into a sequence of evolving queries that need to be efficiently supported. Second, the quality of the results needs to be controlled, e.g., dealing with licenses requires that there be no false negatives. To satisfy these requirements, a workload-aware striping framework for high-dimensional evolving queries is proposed. This framework can be used to boost most existing high-dimensional indexes. In addition, to further enhance the scalability of code search systems, a workload-balancing distributed indexing structure is proposed. The goal of existing efforts in distributed indexing has been the localization of queries to data residing at a small number of nodes (i.e., locality-preserving indexing) to minimize communication cost. However, considering that workloads often correlate with data locality, such indexing often generates hotspots. Hence, workload-balancing is proposed as an optimization goal, and a distributed index that evenly distributes the workload is presented.

URI: http://postech.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000001385016
https://oasis.postech.ac.kr/handle/2014.oak/1609

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse