SIGMOD 2011: A memory efficient reachability data structure through bit vector compression (PWAHStackTC)
11 Oct 2011Back in 2010, I worked on a research project at the University of Oxford with Prof. Oege de Moor. The project’s aim was to efficiently compute, store, and answer reachability queries on large directed graphs. This work eventually led to a paper at ACM SIGMOD 2011, one of the largest international conferences on management of data. In this post, I will endeavour to convey the intuition behind our approach: bit vector compression using a specially designed scheme called PWAH.
The SIGMOD paper [VanSchaik2011] can be downloaded from the ACM digital library (provided that you have access to that). The DOI is 10.1145/1989323.1989419. The MSc. thesis on which this paper is based can be downloaded from the references [VanSchaik2010]. A copy of the source code of our PWAH implementation and the implementation of Nuutila’s algorithm is available on GitHub: github.com/sj/pwahstacktc.
Some background: Tarjan’s algorithm
In 1972, Robert Tarjan showed [Tarjan1972] how to use a depth first search to find the strongly connected components in a directed graph. This algorithm was extended by Esko Nuutila in the 1990’s [Nuutila1995] to store reachability information of strongly connected components at the same time.
The algorithm introduced by Nuutila depends on an underlying data structure to store and (very important!) merge reachability information. Examples of such data structures include, but are not limited to:
- Array of integers (indices of strongly connected components)
- Bit vectors (each 1-bit indicates a reachable component)
- Interval lists
Adding bit vector compression
Introduction
In our experiments, bit vectors provided (by far) the best performance when merging reachability information of two strongly connected components. This can be done by computing the bitwise OR of the two bit vectors. However, storing a O(n) size bit vector for all n strongly connected components (worst case: n = |V|, the number of vertices) is equivalent to constructing (little over half) an n by n reachability matrix. This is considered infeasible for large graphs, i.e. storing reachability information for a graph with 1,000,000 vertices (and an equal number of strongly connected components) requires (n * n + n) / 2 = 500,000,500,000 bits = over 58 gigabytes.
Introducing PWAH compression
In order to reduce the memory usage of the bit vectors, whilst maintaining its main advantage (efficiency of merge operations), we decided to design a bit vector compression technique called Partitioned Word-Aligned Hybrid Compression (PWAH), based on Word-Aligned Hybrid Compression (WAH) by Wu et al. [Wu2004].
PWAH compresses bits by using 64-bit words which are split into 2 (PWAH-2), 4 (PWAH-4) or 8 (PWAH-8) partitions. The idea is most easily explained by an example PWAH-4 compressed 64-bits word:
1010 | 10000000001011 | 111010000100101 | 000000000010010 | 000000010000011 |
header | partition 1: fill of 11 blocks of 1-bits | partition 2: literal of 15 bits | partition 3: fill of 18 blocks of 0-bits | partition 4: literal of 15 bits |
The header and the four different partitions should be interpreted as follows:
- Header: 4 bits that indicate the type of each of the four partitions. A 1-bit in the header indicates a fill partition (see description of partition 1), a 0-bit indicates a literalpartition (see description of partition 0).
- Partition 1: a fill partition (as indicated by the header): a long sequence of consecutive identical bits. The preceding 1-bit indicates that this partition represents a series of 1-bits, the other bits (0...01011) indicate the length of the consecutive series: 11 blocks = 165 bits (the block size of PWAH-4 equals 15 bits, see [VanSchaik2010, VanSchaik2011] for details).
- Partition 2: a literalpartition (as indicated by the header): 15 bits which are not consecutive identical bits. These bits are included uncompressed.
- Partition 3: a fillpartition representing 18 blocks of 0-bits (= 270 0-bits).
- Partition 4: another literal partition with 15 uncompressed bits.
The 64-bits PWAH compressed word represents a total of 465 uncompressed bits. Actively using some of the properties of a depth-first search, the compression scheme yields compressions of over a factor 6000 when applied to reachability information in strongly corrected components.
For a much more detailed description of PWAH in general, the different PWAH schemes, a theoretical and experimental analysis: see [VanSchaik2010, VanSchaik2011].
Paper errata
Errata for [VanSchaik2011]:
- In Figure 2 (“WAH compression example”) on page 4, the last word of the WAH compressed bit vector is not a 0-fill, but a 1-fill
References
In order of appeareance:
- VanSchaik2010: Sebastiaan J. van Schaik. Answering reachability queries on large directed graphs, introducing a new data structure using bit vector compression. MSc. thesis, Utrecht University, 2010. Download from here.
- VanSchaik2011: Sebastiaan J. van Schaik and Oege de Moor. A memory efficient reachability data structure through bit vector compression. In SIGMOD '11: Proceedings of the 37th SIGMOD international conference on Management of data, pages 913-924, New York, NY, USA, 2011. ACM. DOI: 10.1145/1989323.1989419.
- Tarjan1972: Robert E. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1(2):146{160, 1972.
- Nuutila1995: Esko Nuutila. Efficient Transitive Closure Computation in Large Digraphs. PhD thesis, Finnish Academy of Technology, 1995. http://www.cs.hut.fi/~enu/tc.html.
- Wu2004: Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. An efficient compression scheme for bitmap indices. Technical report, ACM Transactions on Database Systems, 2004.
- Jin2008: R. Jin, Y. Xiang, N. Ruan and H. Wang. Efficiently answering reachability queries on very large directed graphs. In SIGMOD '08: Proceedings of the ACM SIGMOD international conference on Management of data, pages 595{608, New York, NY, USA, 2008.
- Jin2009: R. Jin, Y. Xiang, N. Ruan and D. Fuhry. 3-HOP: a high-compression indexing scheme for reachability query. In SIGMOD '09: Proceedings of the ACM SIGMOD international conference on Management of data, pages 813{826, New York, NY, USA, 2009.
Source code
The implementation is licensed under the GPL:
The following guide will help you download, compile and run the PWAH code on a 64-bit Linux machine. If you are a Windows user, your best bet is to install Eclipse (with the CDT plugin for C++ development) and import the PWAH Eclipse project into your workspace.
The C++ source code (including a Makefile and an Eclipse CDT project specification) of the implementation is available through a publicly accessible Git repository. If you have Git (see the Git website) installed, you can run the following command to download the source code (and datasets) into a directory called “PWAHStackTC”:
git clone https://github.com/sj/PWAHStackTC
If you don’t like using Git and would rather have a ZIP-file with the most recent implementation, visit the public GitHub repository and click on "Download ZIP".
Compilation instructions can be found in the provided README.txt file, which has been tested with the GNU C++ compiler (versions 4.5, 4.6, 4.7). Should you encounter any problems compiling or running PWAHStackTC, feel free to open a GitHub issue. After having successfully compiled the source code in the "Release" directory, invoke PWAHStackTC with the command line parameter “--help
” (that’s “help” preceded by two dashes) for usage information. Simple usage example on an extremely small graph (10 vertices, included in the source code):
$ ./PWAHStackTC --filename=../../datasets/nuutila.graph Parsing graph file: /tmp/nuutila.graph... done, that took 0.1780 msecs Number of vertices: 10, number of edges: 17 Computing transitive closure using PWAHStackTC -- not indexed> (ALWAYS using multi-OR)... done, that took 0.0690 msecs Number of components (vertices in condensation graph): 4 Counting number of edges in condensed transitive closure... 7 edges Counting number of edges in transitive closure... 62 edges Memory used by the reachability data structures of the PWAHStackTC -- not indexed> algorithm: 192 bits Memory used by equivalent interval lists: 192 bits Total memory used by the PWAHStackTC -- not indexed> algorithm: 512 bits Generating 1000000 random sources and destinations... done, that took 53.3270 msecs Performing 1000000 random queries... done, that took 22.7400 msecs 530326 pairs turned out to be reachable. Number of bits required to store WAH compressed bitsets: 512 Average construction time over 1 runs: 0.0690 msecs Average query time over 1 runs: 22.7400 msecs for 1000000 queries AVG_CONSTR_TIME=0.0690 AVG_QUERY_TIME=22.7400 MEM_USAGE=512
PWAHStackTC expects its input graphs in the following format:
- line 1: the number of vertices n and the number of edges m, separated by a space
- following n lines: adjacency lists for vertex n, listing the indices of adjacent vertices. Note that indices start at 1. A vertex without any adjacent vertices is encoded as an empty line.
Data sets
Data sets from multiple sources have been used for this paper, most noticeably from [Jin2008] and [Jin2009] and data sets kindly provided by Semmle Ltd. The graph files (graph format described above) are included in the Git repository.
Common compile-time and run-time errors
Compiler error: “invalid suffix (...) on integer constant”
(...) Building file: ../src/datastructures/bitsets/wah/WAHBitSet.cpp Invoking: GCC C++ Compiler g++ -O3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/datastructures/bitsets/wah/WAHBitSet.d" -MT"src/datastructures/bitsets/wah/WAHBitSet.d" -o "src/datastructures/bitsets/wah/WAHBitSet.o" "../src/datastructures/bitsets/wah/WAHBitSet.cpp" In file included from ../src/datastructures/bitsets/wah/WAHBitSet.cpp:26: ../src/datastructures/bitsets/wah/WAHBitSet.h:53:32: error: invalid suffix "b01111111111111111111111111111111" on integer constant ../src/datastructures/bitsets/wah/WAHBitSet.h:56:34: error: invalid suffix "b00000000000000000000000000000000" on integer constant ../src/datastructures/bitsets/wah/WAHBitSet.h:59:35: error: invalid suffix "b11000000000000000000000000000000" on integer constant ../src/datastructures/bitsets/wah/WAHBitSet.h:62:36: error: invalid suffix "b10000000000000000000000000000000" on integer constant (...)
This means you’re using a GCC version older than 4.3. Older GCC versions don’t support integer constants written in binary format, see: http://gcc.gnu.org/gcc-4.3/changes.html. At the time of writing, GCC 4.3 has been available for more than five years. Please upgrade your compiler.
Compiler error: “this implementation depends on 64-bit longs, your platform does not seem to offer those. Aborting.”
This message is fairly self-explanatory: you’re trying to compile the source code on a 32-bit operating system, but both the theory behind PWAH requires 64-bit data types. If you do have a 64-bit processor, you will have to upgrade your operating system to its 64-bit equivalent. If you have an old 32-bit hardware architecture, the only solution is buying a new machine. Sorry.
Contact me
If you have any problems downloading, compiling, or using the implementation, please open a GitHub issue in the repository.