Open Access System for Information Sharing

Department of Computer Science & Engineering (컴퓨터공학과) 3. Theses_Ph.D.

Thesis

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Scalable Execution of Massive Number of Small Queries in Spark

Title: Scalable Execution of Massive Number of Small Queries in Spark

Authors: 박연수

Date Issued: 2024

Publisher: 포항공과대학교

Abstract: Spark big data processing platform is heavily used in today's IT services for various critical applications such as machine learning tasks for service recommendations or massive volumes of raw sales data analysis. Spark is designed to deliver high performance by enabling a high degree of parallelism while processing various heavy-weight queries that require homogeneous operations on large data. However, it has been observed that workloads made of small and short-running queries coming from various sources are becoming dominant in practice. Unfortunately, the current Spark architecture is unfit to process workloads made of a large number of small queries optimally due to excessive I/Os with small computations. To improve the handling of a massive number of small queries in Spark, we hypothesize that the most effective fix to the problem is to `merge a massive number of small queries into several big queries' to conform to what Spark was designed for to utilize its strong point - distributed parallel processing on a large-sized dataset. To verify this hypothesis, this dissertation proposes two techniques: QaaD and QBatcher. First, we propose QaaD to execute a merged query in Spark that addresses this problem of Spark fundamentally by applying i) transparent conversion of workloads made of small queries into one large query as a batch and ii) dynamic partition size adjustment for runtime overhead minimization. To support our design of query merging, we introduce a new abstraction, microRDD, the embedding of queries as part of data, and an opportunistic sharing of common input data among queries. Second, we propose QBatcher that constructs cost-efficient batches of queries for QaaD to reduce the wait times for queries completed earlier within the batch by grouping queries into batches by key attributes and reordering the batches. Comprehensive evaluation using real-world data shows that QaaD coupled with QBatcher is able to deliver a speed-up of tens of times against standard Spark executions for small query workloads.
Spark 빅데이터 처리 플랫폼은 서비스 추천을 위한 기계 학습 작업이나 방대한 양의 원시 판매 데이터 분석과 같은 다양한 중요 애플리케이션을 위해 오늘날의 정보기술 서비스에서 많이 사용되고 있습니다. Spark 는 대용량 데이터에 대한 균일한 작업이 필요한 다양한 무거운 질의를 처리하는 동시에 높은 수준의 병렬 처리를 지원하여 고성능을 제공하도록 설계되었습니다. 그러나 다양한 소스에서 발생하는 작고, 단기 실행 질의로 구성된 워크로드가 실제로는 주류를 이루고 있습니다. 안타깝게도 현재의 Spark 구조는 작은 연산을 위한 과도한 I/O 로 인해 다수의 작은 질의로 구성된 워크로드를 최적으로 처리하는 데 적합하지 않습니다. 대규모 데이터 세트에 대한 분산 병렬 처리라는 Spark 의 강점을 살리기 위해 설계된 목적에 맞게, Spark 에서 대량의 작은 질의 처리를 개선하기 위해서는 `대량의 작은 질의를 몇 개의 큰 질의로 병합'하는 것이 가장 효과적인 문제 해결 방법이라는 가설을 세웠습니다. 이 가설을 검증하기 위해 이 논문에서는 QaaD 와 QBatcher 을 제안합니다. 먼저, i) 작은 질의로 구성된 워크로드를 하나의 큰 질의로 일괄 변환하고, ii) 수행 시간 비용 최소화를 위한 동적 파티션 크기 조정을 적용하여 이 문제를 근본적으로 해결하는 QaaD 라는 기법을 제시합니다. 질의 병합 설계를 지원하기 위해 데이터의 일부로 질의를 임베딩하고 질의 간에 공통 입력 데이터를 공유하는 새로운 추상화인 microRDD 를 도입했습니다. 둘째, 질의를 주요 속성별로 배치로 그룹화하고 배치를 다시 정렬하여 배치 내에서 이전에 완료된 질의의 대기 시간을 줄이기 위해 QaaD 을 위한 비용 효율적인 질의 배치를 구성하는 QBatcher 을 제안합니다. 실제 데이터를 사용한 종합적인 평가 결과, QaaD 와 QBatcher 를 결합하면 다수의 작은 질의로 구성된 워크로드에 대해 표준 Spark 실행에 비해 수십 배의 속도 향상을 제공할 수 있는 것으로 나타났습니다.

URI: http://postech.dcollection.net/common/orgView/200000732986
https://oasis.postech.ac.kr/handle/2014.oak/123325

Article Type: Thesis

Files in This Item:: There are no files associated with this item.

Show full item record

qr_code

트윗하기

Communities & Collection

Department of Computer Science & Engineering (컴퓨터공학과)

Open Access System for Information Sharing

Communities & Collection

Views & Downloads

Browse