#include <iostream>
#include <string>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create();
std::string key;
while (std::getline(std::cin, key)) {
sketch.inc(key.c_str(), key.length());
}
sketch.save(argv[1]);
return 0;
}
$ g++ draw-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET
Let’s try to draw a sketch. This example reads a keyset from standard input (std::cin) and draws a sketch (madoka::Sketch
). Then, this example saves the sketch to a file specified by the 1st command line argument (argv[1]). The following are the points of this example.
#include </code> is needed to use Madoka. Data types, constants and classes are defined in madoka.h.
madoka::Sketch
madoka::Sketch
represents a sketch and provides functions for sketching.madoka::Sketch::create()
create()
is a function to make a sketch. You must not start sketching before making a sketch.madoka::Sketch::inc()
inc()
is a function to increment a value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. Yeah, this is sketching.inc()
does nothing if the current value is saturated.madoka::Sketch::save()
save()
is a function to save a sketch. The 1st argument specifies the path of the file.Note that an option -lmadoka is needed to build this example. If you have installed Madoka, pkg-config madoka --libs is available to get the required options.
#include <iostream>
#include <string>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.load(argv[1]);
// sketch.open(argv[1]);
std::string key;
while (std::getline(std::cin, key)) {
std::cout << key << ": "
<< sketch.get(key.c_str(), key.length()) << std::endl;
}
return 0;
}
$ g++ look-at-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET
Next, let’s try to look at a sketch. This example loads a sketch from a file specified by the 1st command line argument (argv[1]). Then, this example looks up keys read from standard input (std::cin). The following are the points of this example.
madoka::Sketch::load()
load()
is a function to load a sketch from a file. The 1st argument specifies the path of the file.open()
is also available to access a sketch file when you don’t want to load the whole file. open()
uses memory mapped I/O instead of reading the whole sketch into memory.madoka::Sketch::get()
get()
is a function to get the value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key.h3. Use other brushes
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create();
sketch.set("QB", 2, 10);
std::cout << "QB: "
<< sketch.add("QB", 2, 5) << std::endl;
sketch.set("QB", 2, 7);
std::cout << "QB: "
<< sketch.get("QB", 2) << std::endl;
return 0;
}
$ g++ use-other-brushes.cc -lmadoka
$ ./a.out
QB: 15
QB: 15
madoka::Sketch
provides other drawing functions named set()
and add()
. This example shows how these functions work.
madoka::Sketch::set()
set()
is a function to update a value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. The 3rd argument specifies the value.set()
does nothing when the specified value is not greater than the current associated value.madoka::Sketch::add()
add()
is a function to perform an addition. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. The 3rd argument specifies the value to be added. The return value of add()
is the result of the addition.In this example, the 1st set()
changes the value associated with "QB" from 0 to 10. Then, add()
adds 5 to that value (from 10 to 15). After that, the 2nd set()
does nothing because the specified value (7) is less than the current value (15).
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
const madoka::UInt64 MY_WIDTH = 1ULL << 24;
const madoka::UInt64 MY_MAX_VALUE = 10;
madoka::Sketch sketch;
sketch.create(MY_WIDTH, MY_MAX_VALUE);
std::cout << "width: " << sketch.width() << std::endl;
std::cout << "max_value: " << sketch.max_value() << std::endl;
std::cout << "size: " << sketch.file_size() << std::endl;
return 0;
}
$ g++ customize-a-sketch.cc -lmadoka
$ ./a.out
width: 16777216
max_value: 15
size: 25165904
Let’s customize a sketch for your application. A Count-Min sketch is a probabilistic data structure and the accuracy depends on its parameters, width and depth, and the target data stream.
Basically, accurate sketching requires a large width and a longer stream requires a larger width, but a larger width requires a larger memory space and increases cache misses. The other parameter, depth, also has an effect on accuracy but Madoka uses a fixed depth, actually madoka::SKETCH_DEPTH (3), based on benchmarks. Instead, Madoka has another parameter, max_value, that specifies the upper limit of the values. By using a small max_value, you can save memory.
To customize a sketch, specify width and max_value when creating a sketch. See the following for more details.
create()
rounds out a given max_value to 1, 3, 15, 255, 65535 or 245 - 1.This example creates a customized sketch and prints the size. Note that the size in bytes of a sketch is approximately width x 8 if max_value == 245 - 1, or otherwise width x depth x log2(max_value + 1) / 8.
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create();
sketch.set("Sayaka", 6, 100);
sketch.clear();
std::cout << "Sayaka: "
<< sketch.get("Sayaka", 6) << std::endl;
return 0;
}
$ g++ clear-a-sketch.cc -lmadoka
$ ./a.out
Sayaka: 0
madoka::Sketch
provides an interface to clear a sketch. This example creates a sketch and updates the value associated with "Sayaka" from 0 to 100, but clear()
fills the sketch with 0s. As a result, get()
returns 0 for "Sayaka".
madoka::Sketch::clear()
clear()
is a function to clear a sketch, or more precisely, clear()
fills a sketch with 0s.#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create();
sketch.set("Kyoko", 5, 100);
madoka::Sketch snapshot;
snapshot.copy(sketch);
sketch.add("Kyoko", 5, 50);
std::cout << "Kyoko (original): "
<< sketch.get("Kyoko", 5) << std::endl;
std::cout << "Kyoko (snapshot): "
<< snapshot.get("Kyoko", 5) << std::endl;
return 0;
}
$ g++ copy-a-sketch.cc -lmadoka
$ ./a.out
Kyoko (original): 150
Kyoko (snapshot): 100
madoka::Sketch
provides an interface to copy a sketch. This example draws a sketch and creates its copy as a snapshot. Then, this example updates the original sketch.
madoka::Sketch::copy()
copy()
is a function to create a copy of a sketch. The 1st argument specifies the source sketch.As shown in this example, copy()
is useful to create a snapshot in memory. If you want to save a snapshot as a file, save()
is an easier choice.
#include <iostream>
#include <madoka.h>
madoka::UInt64 divide_by_2(madoka::UInt64 value) {
return value / 2;
}
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create();
sketch.set("Kaname", 6, 8);
sketch.set("Madoka", 6, 15);
sketch.filter(divide_by_2);
// sketch.filter([](madoka::UInt64 value) { return value / 2; });
std::cout << "Kaname: "
<< sketch.get("Kaname", 6) << std::endl;
std::cout << "Madoka: "
<< sketch.get("Madoka", 6) << std::endl;
return 0;
}
$ g++ apply-a-filter.cc -lmadoka
$ ./a.out
Kaname: 4
Madoka: 7
madoka::Sketch
provides a filter feature which can be used to reduce errors and to simulate decays. This example uses a filter for dividing all the values in a sketch by 2.
madoka::Sketch::filter()
filter()
is a function to apply a filter to a sketch, or more precisely, filter()
applies a filter to all the values in a sketch. The 1st argument specifies the filter.filter()
accepts a pointer to a function that takes the current value (madoka::UInt64
) and returns the filtered value (madoka::UInt64
).madoka::UInt64
.filter()
does nothing if the argument is NULL.For example, by using this feature, you can implement a variety of lossy conservative updates. Also, you can divide all the values by 2 when one of the values reaches max_value. In addition, you can replace all the values with their binary logarithms for compression, etc.
#include <iostream>
#include <madoka.h>
madoka::UInt64 logarithmize(madoka::UInt64 value) {
return 63 - ::__builtin_clzll(value | 1);
}
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create(100);
sketch.set("Akemi", 5, 256);
sketch.set("Homura", 6, 16777216);
madoka::Sketch new_sketch;
new_sketch.shrink(sketch, 10, 15, logarithmize);
std::cout << "width: "
<< new_sketch.width() << std::endl;
std::cout << "max_value: "
<< new_sketch.max_value() << std::endl;
std::cout << "Akemi: "
<< new_sketch.get("Akemi", 5) << std::endl;
std::cout << "Homura: "
<< new_sketch.get("Homura", 6) << std::endl;
return 0;
}
$ g++ shrink-a-sketch.cc -lmadoka
$ ./a.out
width: 10
max_value: 15
Akemi: 8
Homura: 15
Let’s shrink a sketch for saving memory. This is a reasonable answer to the question “How can I determine the best values for width and max_value?”.
During or after sketching, you may find that width and max_value are too large. In such a case, you can shrink the sketch, or more precisely, you can create a smaller sketch and copies the contents of the source sketch to the smaller sketch.
madoka::Sketch::shrink()
shrink()
is a function to shrink a sketch. The 1st argument specifies the source sketch. The 2nd argument specifies width and the 3rd argument specifies max_value of the new sketch. The 4th argument specifies a filter to be applied in shrinking.This example creates a sketch and shrinks the sketch to a smaller sketch. The new width and the new max_value are 10 and 15 respectively. Note that the values are replaced with their logarithms by logarimize()
. For example, "Akemi: 256" is replaced with "Akemi: 8". Also note that "Homura: 16777216" is replaced with "Homura: 24" but the output is "Homura: 15" because of saturation.
Due to this feature, you can use large width and max_value during sketching, and after that, you can adjust width and max_value to your application. Also, you can compress a sketch as shown in this example, binarize a sketch based on a threshold, etc.
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch_1, sketch_2;
sketch_1.create();
sketch_2.create();
sketch_1.inc("Tomoe", 5);
sketch_1.inc("Mami", 4);
sketch_2.inc("Mami", 4);
madoka::Sketch sketch;
sketch.copy(sketch_1);
sketch.merge(sketch_2);
std::cout << "Tomoe: "
<< sketch.get("Tomoe", 5) << std::endl;
std::cout << "Mami: "
<< sketch.get("Mami", 4) << std::endl;
return 0;
}
$ g++ merge-sketches.cc -lmadoka
$ ./a.out
Tomoe: 1
Mami: 2
madoka::Sketch
provides an interface to merge sketches. A sketch merging works like vector addition. It simply adds the values of the right-hand side sketch (rhs-sketch) to the values of the left-hand side sketch (lhs-sketch).
madoka::Sketch::merge()
merge()
is a function to merge sketches. The 1st argument specifies the rhs-sketch. The 2nd argument specifies a filter that is applied to the lhs-sketch values. The 3rd argument specifies a filter that is applied to the rhs-sketch values.This example creates two sketches with "Mami: 1" and merges the sketches. So, the resultant sketch has "Mami: 2". Note that this example uses copy()
to keep the lhs-sketch. If you don’t mind overwriting the lhs-sketch, you can call merge()
directly.
The merging feature allows you to draw a sketch in a distributed manner. For example, you can divide an input data stream into several streams and draw sketches for each stream in parallel. Then, you can merge the temporary sketches to get the final sketch.
#include <cmath>
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch_1, sketch_2;
sketch_1.create();
sketch_2.create();
sketch_1.add("Charlotte", 9, 3);
sketch_1.add("Oktavia", 7, 2);
sketch_2.add("Gretchen", 8, 5);
sketch_2.add("Charlotte", 9, 4);
double length_1, length_2;
double inner_product =
sketch_1.inner_product(sketch_2, &length_1, &length_2);
length_1 = std::sqrt(length_1);
length_2 = std::sqrt(length_2);
std::cout << "inner_product: " << inner_product << std::endl;
std::cout << "length_1: " << length_1 << std::endl;
std::cout << "length_2: " << length_2 << std::endl;
std::cout << "cosine: "
<< (inner_product / length_1 / length_2) << std::endl;
return 0;
}
$ g++ estimate-the-inner-product.cc -lmadoka
$ ./a.out
inner_product: 12
length_1: 3.60555
length_2: 6.40312
cosine: 0.519778
A Count-Min sketch supports inner product estimation. madoka::Sketch
provides an interface to estimate the inner product with the length of sketches, actually length is not defined for sketches. This interface is useful to estimate the cosine similarity.
madoka::Sketch::inner_product()
inner_product()
is a function to estimate inner product. The 1st argument specifies the rhs-sketch. The 2nd argument can be used to get the estimated squared length of the lhs-sketch. The 3rd argument can be used to get the estimated squared length of the rhs-sketch. The return value of inner_product()
is the estimated inner product.This example creates two sketches and estimates the inner product between the sketches. This example also computes the cosine similarity from the obtained values.
namespace madoka {
enum FileFlag {
FILE_CREATE = 1 << 0,
FILE_TRUNCATE = 1 << 1,
FILE_READONLY = 1 << 2,
FILE_WRITABLE = 1 << 3,
FILE_SHARED = 1 << 4,
FILE_PRIVATE = 1 << 5,
FILE_ANONYMOUS = 1 << 6,
FILE_HUGETLB = 1 << 7,
FILE_PRELOAD = 1 << 8
};
} // namespace madoka
create()
, open()
, load()
, save()
, copy()
and shrink()
have arguments named path and flags. The path argument specifies the target file. Note that NULL specifies to create an anonymous memory mapping. The flags argument specifies the behavior as follows.
create()
, copy()
and shrink()
accept FILE_TRUNCATE unless path == NULL.open()
accepts FILE_READONLY.open()
accepts FILE_PRIVATE.create()
, open()
, load()
, save()
, copy()
and shrink()
accept FILE_HUGETLB.open()
accepts FILE_PRELOAD.#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
sketch.create(0, 0, NULL, madoka::FILE_HUGETLB);
if (sketch.flags() & madoka::FILE_HUGETLB) {
std::cout << "HugeTLB: on" << std::endl;
} else {
std::cout << "HugeTLB: off" << std::endl;
}
return 0;
}
$ g++ use-huge-pages.cc -lmadoka
$ grep HugePages_Free /proc/meminfo
HugePages_Free: 0
$ ./a.out
HugeTLB: off
$ sudo sysctl -w vm.nr_hugepages=512
vm.nr_hugepages = 512
$ grep HugePages_Free /proc/meminfo
HugePages_Free: 512
$ ./a.out
HugeTLB: on
Roughly speaking, a Count-Min sketch is composed of hash tables and the number of hash tables is equal to its depth. The depth of madoka::Sketch
is fixed to 3 and thus basic operations, set()
, inc()
and add()
, except get()
, perform at least 3 random accesses.
Because of the random access nature, if a sketch is larger than the CPU cache, cache misses occur in sketching and the memory access latency becomes the bottleneck. In addition, if a sketch is much, much, much larger than the CPU cache, TLB misses reduce the throughput of sketching.
madoka::FILE_HUGETLB is an optional flag to enable the use of huge pages. The use of huge pages reduces TLB misses in sketching. If a sketch is created with madoka::FILE_HUGETLB, the sketch tries to use huge pages. If huge pages are not available, the sketch uses regular pages.
This example shows how to use huge pages. You can check whether huge pages are available or not by reading /proc/meminfo. If disabled, you can enable huge pages by editing /proc/sys/vm/nr_hugepages. Note that only a user with root authority can enable huge pages. After that, madoka::Sketch
can use huge pages. Remember that a sketch backed by a file does not support huge pages unless the file system supports huge pages. For more details, see information about huge page support.
madoka::Sketch
provides an interface to specify a seed, which is used to calculate hash values and to initialize a random number generator. The 5th argument of create()
specifies the seed. However, basically there is no need to specify a user-defined seed. Note that you cannot merge sketches having different seeds.
madoka::Sketch::width()
width()
returns width of the sketch.madoka::Sketch::width_mask()
width_mask()
returns width - 1 if width is a power of 2, otherwise returns 0.madoka::Sketch::depth()
depth()
returns madoka::SKETCH_DEPTH.madoka::Sketch::max_value()
max_value()
returns max_value of the sketch.madoka::Sketch::value_mask()
value_mask()
returns max_value.madoka::Sketch::value_size()
value_size()
returns log2(max_value + 1).madoka::Sketch::seed()
seed()
returns seed of the sketch.madoka::Sketch::table_size()
table_size()
returns the size in bytes of the sketch.madoka::Sketch::file_size()
file_size()
returns the file size in bytes of the sketch. The file size is equal to the sum of the header size and the sketch size.madoka::Sketch::flags()
flags()
returns memory mapping related flags.madoka::Sketch::mode()
mode()
returns madoka::SKETCH_APPROX_MODE if value_size == madoka::SKETCH_APPROX_VALUE_SIZE, otherwise returns madoka::SKETCH_EXACT_MODE.#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Sketch sketch;
try {
sketch.create(madoka::SKETCH_MAX_WIDTH + 1);
} catch (const madoka::Exception &ex) {
std::cerr << "error: "
<< ex.what() << std::endl;
return -1;
}
return 0;
}
$ g++ catch-and-exception.cc -lmadoka
$ ./a.out
error: madoka/sketch.cc:453: width > SKETCH_MAX_WIDTH
Madoka throws an exception when an error occurs. The exception class is madoka::Exception
. This example shows how to catch an exception.
madoka::Exception::what()
what()
is a function to get an error message. The format is __FILE__
« ":" « __LINE__
« ": " « the-reason-of-the-error.In this example, create()
fails to create a sketch because the specifed width is too large. Remember that madoka::Sketch
is guaranteed to be unchanged when an error has occurred.
#include <iostream>
#include <madoka.h>
int main(int argc, char *argv[]) {
madoka::Croquis<float> croquis;
croquis.create();
croquis.set("Madoka", 6, 1.25);
croquis.set("Hiroshi", 7, 2.5);
croquis.add("Madoka", 6, 0.5);
std::cout << "Madoka: "
<< croquis.get("Madoka", 6) << std::endl;
std::cout << "Hiroshi: "
<< croquis.get("Hiroshi", 7) << std::endl;
return 0;
}
$ g++ draw-a-croquis.cc -lmadoka
$ ./a.out
Madoka: 1.75
Hiroshi: 2.5
madoka::Croquis
is a simplified version of madoka::Sketch
. It does not provide inc()
, copy()
, filter()
, shrink()
, merge()
and inner_product()
. Instead, madoka::Croquis
has a template parameter that specifies the type of cells.
This example uses floating point numbers with madoka::Croquis
madoka::Sketch
, but remember that a tiny value may be truncated in add()
because of the rounding.