| Available via | http://dbpubs.stanford.edu/pub/2008-25 |
|
Submitted on |
5th of July 2008 |
|
Author |
Agrawal, Parag; Kifer, Daniel;Olston, Christopher |
|
Title |
Scheduling Shared Scans of Large Data Files |
|
Date of publication |
2008 |
|
Published in |
VLDB 2008 |
|
Citation |
Agrawal, Parag; Kifer, Daniel;Olston, Christopher. Scheduling Shared Scans of Large Data Files, VLDB 2008 |
|
Number of pages |
12 |
|
Language |
English |
|
Project |
Miscellaneous |
|
Type |
Conference or Journal Paper |
|
Subject group |
Miscellaneous |
|
Abstract |
We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs.
This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files.
As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads.
Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high.
We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches. |
|
Contact address |
paraga@cs.stanford.edu |
| Fulltext source |
PDF (pdf, pdf.gz, pdf.zip)
| Management of the document by | siroker@db.stanford.edu
| |