Jekyll2020-04-10T03:18:59+00:00https://pratap.dev/feed.xmlPratap SinghPratap SinghPratap SinghImplementing the count-min sketch in Owl2019-08-20T00:00:00+00:002019-08-20T00:00:00+00:00https://pratap.dev/ocaml/owl/count-min-sketch/sublinear-algorithms/countmin-sketch<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<p>This post describes my project implementing the <em>count-min sketch</em> in OCaml as part of the <a href="https://ocaml.xyz">Owl library</a> for scientific computing. This work was conducted under the supervision of <a href="kcsrk.info">Professor KC Sivaramakrishnan</a>, with guidance from <a href="https://yaduvasudev.github.io/">Professor Yadu Vasudev</a>, at the Indian Institute of Technology Madras during summer 2019.</p>
<h2 id="the-count-min-sketch">The count-min sketch</h2>
<p>The <a href="http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf">count-min sketch</a> is a probabilistic data structure that provides an approximate frequency table over some stream of data. At the cost of returning possibly inaccurate results, the sketch requires memory usage and query times that are independent of the contained data. Variants of this sketch have been proposed for a variety of <a href="https://dl.acm.org/citation.cfm?id=1863207">networking</a> <a href="https://ieeexplore.ieee.org/abstract/document/6006023">applications</a> and in <a href="https://www.hindawi.com/journals/mpe/2013/516298/">MapReduce/Hadoop implementations</a>, and existing implementations include Twitter’s <a href="https://github.com/twitter/algebird"><code class="language-plaintext highlighter-rouge">algebird</code> library</a>. The data structure supports the following two operations:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">incr(v)</code>: increment the count of value <code class="language-plaintext highlighter-rouge">v</code></li>
<li><code class="language-plaintext highlighter-rouge">count(v)</code>: get the current frequency count of value <code class="language-plaintext highlighter-rouge">v</code></li>
</ul>
<p>The count-min sketch consists of essentially the same structure as a Bloom filter. It maintains a two-dimensional table <code class="language-plaintext highlighter-rouge">t</code> of counters, containing <code class="language-plaintext highlighter-rouge">l</code> rows each of length <code class="language-plaintext highlighter-rouge">w</code>. Associated with row <code class="language-plaintext highlighter-rouge">j</code> is a hash function <span><script type="math/tex">h_j : U \mapsto \{0,1,\dots,w-1\}</script></span>, where <script type="math/tex">U</script> is the set of all possible input values. Each counter is initialized to <code class="language-plaintext highlighter-rouge">0</code>. <code class="language-plaintext highlighter-rouge">incr(v)</code> increments the values at <code class="language-plaintext highlighter-rouge">t[j][h_j(v)]</code> for <code class="language-plaintext highlighter-rouge">j</code> in <code class="language-plaintext highlighter-rouge">{0,1,...,l-1}</code>. Thus, the data structure maintains <code class="language-plaintext highlighter-rouge">l</code> separate counts of each element put in. However, due to hash collisions these counters may differ from the true count. Note that the error in each counter can only be positive, since counters are only ever increased rather than decreased. Thus, the best estimator for the true count of an element is the minimum of all the corresponding counter values. So <code class="language-plaintext highlighter-rouge">count(v)</code> returns <code class="language-plaintext highlighter-rouge">min(t[j][h_j(v)] over j in {0,1,...,w-1})</code>.</p>
<p>The count-min sketch is typically parameterized in terms of <script type="math/tex">\epsilon</script>, the approximation ratio, and <script type="math/tex">\delta</script>, the failure probability. Given these parameters, the sketch provides the following guarantee:</p>
<p>For all values <code class="language-plaintext highlighter-rouge">v</code> in input stream <code class="language-plaintext highlighter-rouge">S</code>, letting <code class="language-plaintext highlighter-rouge">f(v)</code> be the true frequency of <code class="language-plaintext highlighter-rouge">v</code> in <code class="language-plaintext highlighter-rouge">S</code>:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">count(v)</code> <script type="math/tex">\geq</script> <code class="language-plaintext highlighter-rouge">f(v)</code></li>
<li>with probability at least <script type="math/tex">1 - \delta</script>:
<ul>
<li><code class="language-plaintext highlighter-rouge">count(v)</code> <script type="math/tex">-</script> <code class="language-plaintext highlighter-rouge">f(v)</code> <span><script type="math/tex">% <![CDATA[
< \epsilon \times ||S||_1 %]]></script></span></li>
</ul>
</li>
</ul>
<p>where <span><script type="math/tex">||S||_1</script></span> is the <span><script type="math/tex">L_1</script></span> norm of <code class="language-plaintext highlighter-rouge">S</code>, equal to the total number of elements in <code class="language-plaintext highlighter-rouge">S</code>.</p>
<p>It can be shown that this guarantee is achieved by setting <script type="math/tex">w = \Theta(\frac{1}{\epsilon})</script> and <script type="math/tex">l = \Theta(\log(\frac{1}{\delta}))</script>. Thus, the memory size of the sketch is <script type="math/tex">\Theta(\frac{1}{\epsilon} \log(\frac{1}{\delta}))</script>, and the time to execute <code class="language-plaintext highlighter-rouge">count</code> and <code class="language-plaintext highlighter-rouge">incr</code> are <script type="math/tex">\Theta(\log(\frac{1}{\delta}))</script>. (See proof <a href="http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf">here</a> or <a href="https://www.sketchingbigdata.org/fall17/lec/lec7.pdf">here</a>).</p>
<p>Note that these resource usages are independent of both <span><script type="math/tex">||S||_1</script></span> and the number of distinct elements of <code class="language-plaintext highlighter-rouge">S</code>. Contrast this to the most common implementation of a frequency table–using a hash table to store <code class="language-plaintext highlighter-rouge">(value, count)</code> pairs. The size of the hash table scales as the number of distinct values encountered in the stream, which is <span><script type="math/tex">O(||S||_1)</script></span>.</p>
<p>The cost of this very low resource usage is some inaccuracy in the estimated frequencies. Note that the error is bounded by <span><script type="math/tex">||S||_1</script></span>, meaning that the sketch can give very inaccurate results for highly infrequent elements, but is guaranteed (with high probability) to give good estimates for the most frequent elements. Thus, the sketch is most useful for problems in which we are concerned with the most common elements in the stream.</p>
<h2 id="the-heavy-hitters-sketch">The heavy-hitters sketch</h2>
<p>The simplest problem for which a count-min sketch is used is simply finding the frequent elements in a stream: the <em><script type="math/tex">k</script>-heavy hitters</em> problem. The <script type="math/tex">k</script>-heavy hitters in a data set are those elements whose relative frequency is greater than <script type="math/tex">1/k</script>. We can use the count-min sketch to build a <a href="http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf#page=12">heavy-hitters sketch</a>, which supports the following operations:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">add(v)</code>: increment the count of value <code class="language-plaintext highlighter-rouge">v</code></li>
<li><code class="language-plaintext highlighter-rouge">get()</code>: return the <code class="language-plaintext highlighter-rouge">k</code>-heavy hitters and their frequencies</li>
</ul>
<p>The heavy hitters sketch is parameterized in terms of <script type="math/tex">k</script>, <script type="math/tex">\epsilon</script>, and <script type="math/tex">\delta</script>. It provides the following guarantee:<br />
With probability at least <script type="math/tex">1 - \delta</script>, the data structure</p>
<ul>
<li>outputs every element that appears at least <script type="math/tex">\frac n k</script> times, and</li>
<li>outputs only elements which appear at least <script type="math/tex">\frac{n}{k} - \epsilon n</script> times.<br />
Note that for the sketch to give useful results, we should set <script type="math/tex">% <![CDATA[
\epsilon < 1/k %]]></script>.</li>
</ul>
<p>Note that the count-min sketch does not actually store the values whose frequencies it counts, instead relying on hashing to make the counts retrievable. Indeed, simply storing each value would get us to the same asymptotic memory usage as a hash table, thus defeating the point of the sketch. But in the heavy hitters case, we need to maintain the actual values of the heavy hitters as well as their frequencies. This is done using a min-heap. In addition to a count-min sketch for frequencies, the heavy hitters sketch maintains a min-heap of the elements currently considered to be heavy hitters. Whenever <code class="language-plaintext highlighter-rouge">add(v)</code> is called, first <code class="language-plaintext highlighter-rouge">incr(v)</code> is called on the underlying count-min sketch. Then, <code class="language-plaintext highlighter-rouge">count(v)</code> is called to get the current count of <code class="language-plaintext highlighter-rouge">v</code>. If this count is greater than <script type="math/tex">n/k</script> (where <script type="math/tex">n</script> is the total number of elements added so far), then the pair <code class="language-plaintext highlighter-rouge">(v, count(v))</code> is added to the min-heap. If a pair containing <code class="language-plaintext highlighter-rouge">v</code> is already present in the heap, then its position is adjusted to reflect the new count. Finally, we remove elements whose frequency is too low to be a heavy hitter: while the minimum element in the heap has frequency less than <script type="math/tex">n/k</script>, we remove the minimum element. To implement <code class="language-plaintext highlighter-rouge">get()</code>, we simply transform the heap into a list or array and return it.</p>
<p>The memory usage is that used by the count-min sketch and the min-heap. Note that the maximum size of the heap is the number of heavy-hitters, which is <script type="math/tex">O(k)</script>. Thus, the total memory usage is <script type="math/tex">\Theta(\frac{1}{\epsilon} \log(\frac{1}{\delta}) + k)</script>. The update time is <script type="math/tex">\Theta(\log(\frac{1}{\delta}) + k)</script>.</p>
<h2 id="implementation">Implementation</h2>
<p>My implementation was designed so as to be as minimal as possible, allowing the user to choose how data is input. Additionally, I wanted to make it easy to use the sketches in an “online” situation, where data arrive continuously and frequency information can be requested at any time. There were a few issues that I came across when designing and implementing the sketch.</p>
<h3 id="different-types-of-table">Different types of table</h3>
<p>The count-min sketch requires an underlying table to store the counters, and there were a couple of choices for this. OCaml provides a native <code class="language-plaintext highlighter-rouge">array</code> data structure, while Owl provides various types of <code class="language-plaintext highlighter-rouge">ndarray</code> built on top of the OCaml <code class="language-plaintext highlighter-rouge">Bigarray</code>. The Owl <code class="language-plaintext highlighter-rouge">ndarray</code> is designed to support complex operations like slicing very efficiently, but the sketch only requires getting and setting a handful of individual values in each operation. Since the sketch is independent of the underlying implementation of the table, I decided to functorize my implementation to abstract this out. I first wrote a signature <code class="language-plaintext highlighter-rouge">Owl_countmin_table.Sig</code> which defines a type <code class="language-plaintext highlighter-rouge">t</code> for tables and initialization, increment, and get functions over it. I then wrote two implementations, <code class="language-plaintext highlighter-rouge">Owl_countmin_table.Native</code> and <code class="language-plaintext highlighter-rouge">Owl_countmin_table.Owl</code>, using the OCaml <code class="language-plaintext highlighter-rouge">array</code> and the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> respectively. Owl <code class="language-plaintext highlighter-rouge">ndarray</code>s are available only with floating-point or complex values, so I chose a single-precision floating point value since we never use the decimal part. Finally, the actual count-min sketch was implemented as the functor <code class="language-plaintext highlighter-rouge">Owl.Countmin_sketch.Make</code>, which takes a module with signature <code class="language-plaintext highlighter-rouge">Owl_countmin_table.Sig</code>.</p>
<p>As explained later, this proved unnecessary since the native <code class="language-plaintext highlighter-rouge">array</code> outperformed the Owl array in all respects. However, I still feel it was a useful exercise since it ultimately made the code for the sketch more readable.</p>
<h3 id="hashing">Hashing</h3>
<p>The count-min sketch requires a pairwise independent hash function family, from which the <script type="math/tex">h_j</script> are drawn. The simplest family of such hash functions is just <script type="math/tex">h(x) = (ax + b) \mod p</script>, where <script type="math/tex">p</script> is a prime and <script type="math/tex">a, b</script> are drawn uniformly at random from <script type="math/tex">\{0,1,\dots,p-1\}</script>. A good choice for <script type="math/tex">p</script> is <script type="math/tex">2147483647 = 2^{31} - 1</script>. In binary, this <script type="math/tex">p</script> is 30 1s, so taking <script type="math/tex">x \mod p</script> can be accomplished by applying a bitwise AND to <script type="math/tex">x</script> and <script type="math/tex">p</script>. Finally, we need to take <script type="math/tex">\mod w</script> to get an index into the length-<script type="math/tex">w</script> row of the count-min table.</p>
<p>To make the sketches polymorphic, we need a hash function that can take any value and return an integer. This is not trivial in OCaml, but fortunately the <code class="language-plaintext highlighter-rouge">Hashtbl</code> module provides <code class="language-plaintext highlighter-rouge">Hashtbl.hash</code> which does exactly this. So <code class="language-plaintext highlighter-rouge">incr s x</code> first takes <code class="language-plaintext highlighter-rouge">Hashtbl.hash x</code>, then applies the above <script type="math/tex">(ax + b) \mod p</script> function.</p>
<h3 id="heavy-hitters-heap-implementation">Heavy-hitters heap implementation</h3>
<p>One issue I ran into during development was how to implement a min-heap priority queue for the heavy-hitters sketch. The OCaml standard library doesn’t have one built in, and it’s not natural to write an array-based heap in a functional language like OCaml. Fortunately, Owl did have a heap implementation already built. However, this implementation (along with most existing implementations) doesn’t support the <code class="language-plaintext highlighter-rouge">decreaseKey</code> operation, which adjusts the priority of an element already in the heap and is commonly assumed in theoretical work. I spent some time attempting to modify the Owl implementation, but this turned out not to be effective. <code class="language-plaintext highlighter-rouge">decreaseKey</code> requires the data structure to remember the location of each element in the heap, and building this additional structure added a lot of bloat and complexity to the code. Instead, I ultimately decided to use OCaml’s native <code class="language-plaintext highlighter-rouge">Set</code> data structure, which is implemented using balanced binary trees and thus has a notion of ordering. While this made the code a lot more readable, it does mean that the runtime of <code class="language-plaintext highlighter-rouge">add</code> is linear in the number of heavy hitters. I decided this was an acceptable tradeoff since the number of heavy hitters is <code class="language-plaintext highlighter-rouge">O(k)</code>, and the heavy-hitters sketch already has a time dependency on <code class="language-plaintext highlighter-rouge">k</code>.</p>
<p>However, using the OCaml native set presented its own problems. The native set is provided as a functor which requires a module containing the type to be stored and a comparison function over those types. However, the type of elements to be stored is <code class="language-plaintext highlighter-rouge">'a * int</code>, meaning that we cannot directly use the <code class="language-plaintext highlighter-rouge">Set.Make</code> functor with a fixed type in our implementation. Instead, I used first-class modules to package up the output of the <code class="language-plaintext highlighter-rouge">Set.Make</code> functor within the heavy-hitters data structure. I’d never used this technique before, but it ended up being quite elegant.</p>
<h2 id="benchmarking-and-testing">Benchmarking and testing</h2>
<p>Once the implementation was done, I had to test it. At first I just used simple tests based on generating random integers from various statistical distributions, but something more robust than this was required. The major difficulty with testing an approximate, probabilistic data structure like this sketch is that the expected output is only approximately known—the sketch can return frequencies that differ from the true frequencies in the underlying data without there necessarily being an error in the implementation. For this reason, I had to use average measures of error when testing.</p>
<p>I also benchmarked my implementation to understand the performance boost it gives over the naive hash-table solution, and to determine which of the two tables was more efficient.</p>
<h3 id="comparing-the-two-table-implementations">Comparing the two table implementations</h3>
<p>To test the two tables, I wrote a simple script that inserted <code class="language-plaintext highlighter-rouge">n</code> elements drawn from a provided distribution into a sketch with parameters <code class="language-plaintext highlighter-rouge">epsilon</code> and <code class="language-plaintext highlighter-rouge">delta</code>, then queried the sketch for the counts of <code class="language-plaintext highlighter-rouge">n</code> more elements drawn from the same distribution. I recorded the time for each <code class="language-plaintext highlighter-rouge">incr</code> and <code class="language-plaintext highlighter-rouge">get</code> operation using <code class="language-plaintext highlighter-rouge">Unix.gettimeofday</code>, being sure to call <code class="language-plaintext highlighter-rouge">Gc.compact</code> before calling the operation such that time spent in the garbage collector did not affect my measurements. I found that the OCaml native <code class="language-plaintext highlighter-rouge">array</code> implementation was about 40% faster (0.23s vs 0.33s for <script type="math/tex">10^5</script> operations) on <code class="language-plaintext highlighter-rouge">incr</code> and 10% faster (0.30s vs 0.33s for <script type="math/tex">10^5</script> operations) on <code class="language-plaintext highlighter-rouge">count</code> compared to the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> implementation.</p>
<p>For memory, I used the <code class="language-plaintext highlighter-rouge">live_words</code> statistic provided by the <code class="language-plaintext highlighter-rouge">Gc</code> module, which measures the total number of words maintained in the OCaml heap. I measured the heap size before the structure was allocated and throughout the sequence of <code class="language-plaintext highlighter-rouge">incr</code> and <code class="language-plaintext highlighter-rouge">count</code> operations. However, I found that using this metric, the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> implementation appeared in some cases to use zero memory! This is because the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> is built on top of the OCaml <code class="language-plaintext highlighter-rouge">Bigarray</code>, which is designed to allow cross-operation with C and Fortran arrays. The <code class="language-plaintext highlighter-rouge">Bigarray</code> array is actually allocated via an FFI to C code, and thus is not included in the OCaml GC heap. So just using <code class="language-plaintext highlighter-rouge">live_words</code> from the OCaml <code class="language-plaintext highlighter-rouge">Gc</code> was not actually measuring the size of these arrays.</p>
<p>To rectify this, I wrote a small utility called <code class="language-plaintext highlighter-rouge">ocaml-getrusage</code> which is a thin OCaml wrapper for the <code class="language-plaintext highlighter-rouge">getrusage</code> Unix system call. <code class="language-plaintext highlighter-rouge">getrusage</code> returns a struct containing various different statistics pertaining to the resource usage of a process. I used the OCaml C FFI to write a thin wrapper around the function. With this tool, I found that the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> generally used about 10-20% more memory than the native array for the same parameters <code class="language-plaintext highlighter-rouge">epsilon</code> and <code class="language-plaintext highlighter-rouge">delta</code>.</p>
<p>Since the native <code class="language-plaintext highlighter-rouge">array</code> outperforms the Owl <code class="language-plaintext highlighter-rouge">ndarray</code> in both speed and memory usage, the native implementation (<code class="language-plaintext highlighter-rouge">Owl_base.Countmin_sketch.Native</code>) should be preferred.</p>
<h3 id="comparing-to-the-naive-implementation">Comparing to the naive implementation</h3>
<p>I also wanted to measure the performance of the sketch against a naive hash-table-based frequency table, which I wrote using the OCaml <code class="language-plaintext highlighter-rouge">Hashtbl</code> module. I used the <code class="language-plaintext highlighter-rouge">news.txt</code> corpus available <a href="https://github.com/ryanrhymes/owl_dataset">here</a>, comprising online news articles containing a total of about 61 million words. I wrote a script that ran both a count-min sketch and a hash-table over this dataset, then extracted the frequencies of each unique word in both data structures and compared them. I also measured the total memory usage for both structures in the same way as described above. (Since I only measured using the native implementation, using <code class="language-plaintext highlighter-rouge">Gc</code> heap words was valid.) The first problem I ran into with this approach was that the average relative error rate was enormous, often on the order of about 100,000%. This was due to large errors in the estimated count-min frequency of very uncommon words. Since most words appear very infrequently, this leads to a large average error rate when the average is taken over all unique words. Instead, bearing in mind that the typical use case of the sketch is to get information about frequent elements in the data, I decided to take the average over only elements with a relative frequency greater than <code class="language-plaintext highlighter-rouge">phi</code>. Using this approach, I generated the following plot showing the tradeoff between memory usage and accuracy:</p>
<p><img src="/assets/images/tradeoff_by_phis.png" alt="Tradeoff between memory usage and accuracy" /></p>
<p>The three measures of error are the mean relative difference in frequency, median relative difference in frequency, and mean difference in frequency relative to the L1 norm of the whole dataset. The ranges used were: <code class="language-plaintext highlighter-rouge">epsilon</code> in <code class="language-plaintext highlighter-rouge">[0.00001, 0.01]</code>, <code class="language-plaintext highlighter-rouge">delta</code> in <code class="language-plaintext highlighter-rouge">[0.0001, 0.01]</code>, <code class="language-plaintext highlighter-rouge">phi</code> in <code class="language-plaintext highlighter-rouge">[0.001, 0.05]</code>. On my 64-bit machine, one heap word is 8 bytes, meaning that the sketches ranged in size from 5.6 KB to 11.2 MB. Compare this to the hash-table implementation, which used about 25 MB. As expected, larger tables resulted in a lower error rate. There appeared to be an approximate power law relationship, meaning that small changes to <code class="language-plaintext highlighter-rouge">epsilon</code> and <code class="language-plaintext highlighter-rouge">delta</code> could lead to very large changes in the accuracy of the sketch; my own experience testing and debugging my implementation matches this. Combined with the difficulty of defining and measuring accuracy, this means that users of the sketch must experiment with these parameters to determine an appropriate balance for their particular situation. However, there are clear benefits to using the sketch versus a simple hash table in cases where absolute precision is not required and only the most frequent elements are of interest.</p>
<h2 id="ongoing-and-future-work">Ongoing and future work</h2>
<p>This implementation already provides the basic functionality of the sketch, which is sufficient for a number of the previously mentioned applications. However, some avenues for further work exist. For me, one of the most interesting of these is to allow for distribution of the sketch over several separate processors or devices. Note that if two sketches are initialized with the exact same hash functions and table size, they can be combined at any time by simply adding the two tables element-by-element. We could imagine a situation where many small, resource-constrained devices are running in a distributed network, but we still want to be able to query the whole network for overall frequency information. In this case, initializing a sketch with the same hash functions on each device would allow us to maintain the frequency table such a way. I have begun to implement and test this functionality on top of my implementation.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This work was funded by a Harvard OCS undergraduate travel and research grant. I am very grateful to Liang Wang, Richard Mortier, and others at OCaml Labs for their guidance on aspects of this project.</p>Pratap Singh