C++ API Documentation

Draw a sketch

#include <iostream>
#include <string>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  std::string key;
  while (std::getline(std::cin, key)) {
    sketch.inc(key.c_str(), key.length());
  }
  sketch.save(argv[1]);
  return 0;
}

$ g++ draw-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET

Let’s try to draw a sketch. This example reads a keyset from standard input (std::cin) and draws a sketch (madoka::Sketch). Then, this example saves the sketch to a file specified by the 1st command line argument (argv[1]). The following are the points of this example.

madoka.h
- #include <madoka.h> is needed to use Madoka. Data types, constants and classes are defined in madoka.h.
madoka::Sketch
- madoka::Sketch represents a sketch and provides functions for sketching.
madoka::Sketch::create()
- create() is a function to make a sketch. You must not start sketching before making a sketch.
madoka::Sketch::inc()
- inc() is a function to increment a value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. Yeah, this is sketching.
- Note that inc() does nothing if the current value is saturated.
madoka::Sketch::save()
- save() is a function to save a sketch. The 1st argument specifies the path of the file.

Note that an option -lmadoka is needed to build this example. If you have installed Madoka, pkg-config madoka --libs is available to get the required options.

Look at a sketch

#include <iostream>
#include <string>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.load(argv[1]);
  // sketch.open(argv[1]);

  std::string key;
  while (std::getline(std::cin, key)) {
    std::cout << key << ": "
              << sketch.get(key.c_str(), key.length()) << std::endl;
  }
  return 0;
}

$ g++ look-at-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET

Next, let’s try to look at a sketch. This example loads a sketch from a file specified by the 1st command line argument (argv[1]). Then, this example looks up keys read from standard input (std::cin). The following are the points of this example.

madoka::Sketch::load()
- load() is a function to load a sketch from a file. The 1st argument specifies the path of the file.
- Note that open() is also available to access a sketch file when you don’t want to load the whole file. open() uses memory mapped I/O instead of reading the whole sketch into memory.
madoka::Sketch::get()
- get() is a function to get the value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key.

Use other brushes

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("QB", 2, 10);
  std::cout << "QB: "
            << sketch.add("QB", 2, 5) << std::endl;
  sketch.set("QB", 2, 7);
  std::cout << "QB: "
            << sketch.get("QB", 2) << std::endl;
  return 0;
}

$ g++ use-other-brushes.cc -lmadoka
$ ./a.out
QB: 15
QB: 15

madoka::Sketch provides other drawing functions named set() and add(). This example shows how these functions work.

madoka::Sketch::set()
- set() is a function to update a value associated with a key. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. The 3rd argument specifies the value.
- Note that set() does nothing when the specified value is not greater than the current associated value.
- Also note that the new value is saturated when the specified value is greater than the upper limit.
madoka::Sketch::add()
- add() is a function to perform an addition. The 1st argument specifies the starting address and the 2nd argument specifies the length in bytes of the key. The 3rd argument specifies the value to be added. The return value of add() is the result of the addition.
- Note that the result is saturated when the sum is greater than the upper limit.

In this example, the 1st set() changes the value associated with "QB" from 0 to 10. Then, add() adds 5 to that value (from 10 to 15). After that, the 2nd set() does nothing because the specified value (7) is less than the current value (15).

Customize a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  const madoka::UInt64 MY_WIDTH = 1ULL << 24;
  const madoka::UInt64 MY_MAX_VALUE = 10;

  madoka::Sketch sketch;
  sketch.create(MY_WIDTH, MY_MAX_VALUE);

  std::cout << "width: " << sketch.width() << std::endl;
  std::cout << "max_value: " << sketch.max_value() << std::endl;
  std::cout << "size: " << sketch.file_size() << std::endl;
  return 0;
}

$ g++ customize-a-sketch.cc -lmadoka
$ ./a.out
width: 16777216
max_value: 15
size: 25165904

Let’s customize a sketch for your application. A Count-Min sketch is a probabilistic data structure and the accuracy depends on its parameters, width and depth, and the target data stream.

Basically, accurate sketching requires a large width and a longer stream requires a larger width, but a larger width requires a larger memory space and increases cache misses. The other parameter, depth, also has an effect on accuracy but Madoka uses a fixed depth, actually madoka::SKETCH_DEPTH (3), based on benchmarks. Instead, Madoka has another parameter, max_value, that specifies the upper limit of the values. By using a small max_value, you can save memory.

To customize a sketch, specify width and max_value when creating a sketch. See the following for more details.

width
- 0 is replaced with the default value madoka::SKETCH_DEFAULT_WIDTH (2²⁰).
- width must not be greater than madoka::SKETCH_MAX_WIDTH (2⁴²).
- Note that you can quickly draw and look at a sketch if the width of the sketch is a power of 2.
max_value
- 0 is replaced with the default value madoka::SKETCH_DEFAULT_MAX_VALUE (2⁴⁵ - 1).
- max_value must not be greater than madoka::SKETCH_MAX_MAX_VALUE (2⁴⁵ - 1).
- create() rounds out a given max_value to 1, 3, 15, 255, 65535 or 2⁴⁵ - 1.

This example creates a customized sketch and prints the size. Note that the size in bytes of a sketch is approximately width x 8 if max_value == 2⁴⁵ - 1, or otherwise width x depth x log₂(max_value + 1) / 8.

Clear a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Sayaka", 6, 100);
  sketch.clear();
  std::cout << "Sayaka: "
            << sketch.get("Sayaka", 6) << std::endl;
  return 0;
}

$ g++ clear-a-sketch.cc -lmadoka
$ ./a.out
Sayaka: 0

madoka::Sketch provides an interface to clear a sketch. This example creates a sketch and updates the value associated with "Sayaka" from 0 to 100, but clear() fills the sketch with 0s. As a result, get() returns 0 for "Sayaka".

madoka::Sketch::clear()
- clear() is a function to clear a sketch, or more precisely, clear() fills a sketch with 0s.
- Note that width and max_value of the sketch remain unchanged.

Copy a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Kyoko", 5, 100);

  madoka::Sketch snapshot;
  snapshot.copy(sketch);

  sketch.add("Kyoko", 5, 50);

  std::cout << "Kyoko (original): "
            << sketch.get("Kyoko", 5) << std::endl;
  std::cout << "Kyoko (snapshot): "
            << snapshot.get("Kyoko", 5) << std::endl;
  return 0;
}

$ g++ copy-a-sketch.cc -lmadoka
$ ./a.out
Kyoko (original): 150
Kyoko (snapshot): 100

madoka::Sketch provides an interface to copy a sketch. This example draws a sketch and creates its copy as a snapshot. Then, this example updates the original sketch.

madoka::Sketch::copy()
- copy() is a function to create a copy of a sketch. The 1st argument specifies the source sketch.

As shown in this example, copy() is useful to create a snapshot in memory. If you want to save a snapshot as a file, save() is an easier choice.

Apply a filter

#include <iostream>

#include <madoka.h>

madoka::UInt64 divide_by_2(madoka::UInt64 value) {
  return value / 2;
}

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Kaname", 6, 8);
  sketch.set("Madoka", 6, 15);

  sketch.filter(divide_by_2);
  // sketch.filter([](madoka::UInt64 value) { return value / 2; });

  std::cout << "Kaname: "
            << sketch.get("Kaname", 6) << std::endl;
  std::cout << "Madoka: "
            << sketch.get("Madoka", 6) << std::endl;
  return 0;
}

$ g++ apply-a-filter.cc -lmadoka
$ ./a.out
Kaname: 4
Madoka: 7

madoka::Sketch provides a filter feature which can be used to reduce errors and to simulate decays. This example uses a filter for dividing all the values in a sketch by 2.

madoka::Sketch::filter()
- filter() is a function to apply a filter to a sketch, or more precisely, filter() applies a filter to all the values in a sketch. The 1st argument specifies the filter.
- filter() accepts a pointer to a function that takes the current value (madoka::UInt64) and returns the filtered value (madoka::UInt64).
- A function object is also acceptable if the type of its argument and return value is madoka::UInt64.
- Note that if the filter returns a value greater than max_value, that value is replaced with max_value.
- filter() does nothing if the argument is NULL.

For example, by using this feature, you can implement a variety of lossy conservative updates. Also, you can divide all the values by 2 when one of the values reaches max_value. In addition, you can replace all the values with their binary logarithms for compression, etc.

Shrink a sketch

#include <iostream>

#include <madoka.h>

madoka::UInt64 logarithmize(madoka::UInt64 value) {
  return 63 - ::__builtin_clzll(value | 1);
}

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create(100);

  sketch.set("Akemi", 5, 256);
  sketch.set("Homura", 6, 16777216);

  madoka::Sketch new_sketch;
  new_sketch.shrink(sketch, 10, 15, logarithmize);

  std::cout << "width: "
            << new_sketch.width() << std::endl;
  std::cout << "max_value: "
            << new_sketch.max_value() << std::endl;

  std::cout << "Akemi: "
            << new_sketch.get("Akemi", 5) << std::endl;
  std::cout << "Homura: "
            << new_sketch.get("Homura", 6) << std::endl;
  return 0;
}

$ g++ shrink-a-sketch.cc -lmadoka
$ ./a.out
width: 10
max_value: 15
Akemi: 8
Homura: 15

Let’s shrink a sketch for saving memory. This is a reasonable answer to the question “How can I determine the best values for width and max_value?”.

During or after sketching, you may find that width and max_value are too large. In such a case, you can shrink the sketch, or more precisely, you can create a smaller sketch and copies the contents of the source sketch to the smaller sketch.

madoka::Sketch::shrink()
- shrink() is a function to shrink a sketch. The 1st argument specifies the source sketch. The 2nd argument specifies width and the 3rd argument specifies max_value of the new sketch. The 4th argument specifies a filter to be applied in shrinking.
- Note that the new width must be a factor of the source width. On the other hand, a max_value greater than the source max_value is acceptable.

This example creates a sketch and shrinks the sketch to a smaller sketch. The new width and the new max_value are 10 and 15 respectively. Note that the values are replaced with their logarithms by logarimize(). For example, "Akemi: 256" is replaced with "Akemi: 8". Also note that "Homura: 16777216" is replaced with "Homura: 24" but the output is "Homura: 15" because of saturation.

Due to this feature, you can use large width and max_value during sketching, and after that, you can adjust width and max_value to your application. Also, you can compress a sketch as shown in this example, binarize a sketch based on a threshold, etc.

Merge sketches

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch_1, sketch_2;
  sketch_1.create();
  sketch_2.create();

  sketch_1.inc("Tomoe", 5);
  sketch_1.inc("Mami", 4);
  sketch_2.inc("Mami", 4);

  madoka::Sketch sketch;
  sketch.copy(sketch_1);
  sketch.merge(sketch_2);

  std::cout << "Tomoe: "
            << sketch.get("Tomoe", 5) << std::endl;
  std::cout << "Mami: "
            << sketch.get("Mami", 4) << std::endl;
  return 0;
}

$ g++ merge-sketches.cc -lmadoka
$ ./a.out
Tomoe: 1
Mami: 2

madoka::Sketch provides an interface to merge sketches. A sketch merging works like vector addition. It simply adds the values of the right-hand side sketch (rhs-sketch) to the values of the left-hand side sketch (lhs-sketch).

madoka::Sketch::merge()
- merge() is a function to merge sketches. The 1st argument specifies the rhs-sketch. The 2nd argument specifies a filter that is applied to the lhs-sketch values. The 3rd argument specifies a filter that is applied to the rhs-sketch values.
- Note that the sketches must have the same width.

This example creates two sketches with "Mami: 1" and merges the sketches. So, the resultant sketch has "Mami: 2". Note that this example uses copy() to keep the lhs-sketch. If you don’t mind overwriting the lhs-sketch, you can call merge() directly.

The merging feature allows you to draw a sketch in a distributed manner. For example, you can divide an input data stream into several streams and draw sketches for each stream in parallel. Then, you can merge the temporary sketches to get the final sketch.

Estimate the inner product

#include <cmath>
#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch_1, sketch_2;
  sketch_1.create();
  sketch_2.create();

  sketch_1.add("Charlotte", 9, 3);
  sketch_1.add("Oktavia", 7, 2);
  sketch_2.add("Gretchen", 8, 5);
  sketch_2.add("Charlotte", 9, 4);

  double length_1, length_2;
  double inner_product =
      sketch_1.inner_product(sketch_2, &length_1, &length_2);
  length_1 = std::sqrt(length_1);
  length_2 = std::sqrt(length_2);

  std::cout << "inner_product: " << inner_product << std::endl;
  std::cout << "length_1: " << length_1 << std::endl;
  std::cout << "length_2: " << length_2 << std::endl;
  std::cout << "cosine: "
            << (inner_product / length_1 / length_2) << std::endl;
  return 0;
}

$ g++ estimate-the-inner-product.cc -lmadoka
$ ./a.out
inner_product: 12
length_1: 3.60555
length_2: 6.40312
cosine: 0.519778

A Count-Min sketch supports inner product estimation. madoka::Sketch provides an interface to estimate the inner product with the length of sketches, actually length is not defined for sketches. This interface is useful to estimate the cosine similarity.

madoka::Sketch::inner_product()
- inner_product() is a function to estimate inner product. The 1st argument specifies the rhs-sketch. The 2nd argument can be used to get the estimated squared length of the lhs-sketch. The 3rd argument can be used to get the estimated squared length of the rhs-sketch. The return value of inner_product() is the estimated inner product.
- Note that the sketches must have the same width.

This example creates two sketches and estimates the inner product between the sketches. This example also computes the cosine similarity from the obtained values.

Configure memory mapping

namespace madoka {

enum FileFlag {
  FILE_CREATE    = 1 << 0,
  FILE_TRUNCATE  = 1 << 1,
  FILE_READONLY  = 1 << 2,
  FILE_WRITABLE  = 1 << 3,
  FILE_SHARED    = 1 << 4,
  FILE_PRIVATE   = 1 << 5,
  FILE_ANONYMOUS = 1 << 6,
  FILE_HUGETLB   = 1 << 7,
  FILE_PRELOAD   = 1 << 8
};

}  // namespace madoka

create(), open(), load(), save(), copy() and shrink() have arguments named path and flags. The path argument specifies the target file. Note that NULL specifies to create an anonymous memory mapping. The flags argument specifies the behavior as follows.

madoka::FILE_CREATE
- FILE_CREATE is used internally to create a file.
madoka::FILE_TRUNCATE
- create(), copy() and shrink() accept FILE_TRUNCATE unless path == NULL.
- FILE_TRUNCATE is used to overwrite an existing file. Note that a creation without FILE_TRUNCATE to an existing file throws an exception.
madoka::FILE_READONLY
- open() accepts FILE_READONLY.
- FILE_READONLY is used to disable modifications. Note that a modification to a read-only sketch will cause a segmentation fault.
madoka::FILE_WRITABLE
- FILE_WRITABLE is used internally to enable modifications.
madoka::FILE_SHARED
- FILE_SHARED is used internally to make modifications visible to other processes.
madoka::FILE_PRIVATE
- open() accepts FILE_PRIVATE.
- FILE_PRIVATE is used to make modifications invisible to other processes.
madoka::FILE_ANONYMOUS
- FILE_ANONYMOUS is used internally to map memory not backed by a file.
madoka::FILE_HUGETLB
- create(), open(), load(), save(), copy() and shrink() accept FILE_HUGETLB.
- FILE_HUGETLB is used to enable huge pages. See the next section for details.
madoka::FILE_PRELOAD
- open() accepts FILE_PRELOAD.
- FILE_PRELOAD is used to preload the entire file after mapping a file. Preloading is useful to avoid random disk access.

Use huge pages

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create(0, 0, NULL, madoka::FILE_HUGETLB);

  if (sketch.flags() & madoka::FILE_HUGETLB) {
    std::cout << "HugeTLB: on" << std::endl;
  } else {
    std::cout << "HugeTLB: off" << std::endl;
  }
  return 0;
}

$ g++ use-huge-pages.cc -lmadoka
$ grep HugePages_Free /proc/meminfo
HugePages_Free:        0
$ ./a.out
HugeTLB: off
$ sudo sysctl -w vm.nr_hugepages=512
vm.nr_hugepages = 512
$ grep HugePages_Free /proc/meminfo
HugePages_Free:      512
$ ./a.out
HugeTLB: on

Roughly speaking, a Count-Min sketch is composed of hash tables and the number of hash tables is equal to its depth. The depth of madoka::Sketch is fixed to 3 and thus basic operations, set(), inc() and add(), except get(), perform at least 3 random accesses.

Because of the random access nature, if a sketch is larger than the CPU cache, cache misses occur in sketching and the memory access latency becomes the bottleneck. In addition, if a sketch is much, much, much larger than the CPU cache, TLB misses reduce the throughput of sketching.

madoka::FILE_HUGETLB is an optional flag to enable the use of huge pages. The use of huge pages reduces TLB misses in sketching. If a sketch is created with madoka::FILE_HUGETLB, the sketch tries to use huge pages. If huge pages are not available, the sketch uses regular pages.

This example shows how to use huge pages. You can check whether huge pages are available or not by reading /proc/meminfo. If disabled, you can enable huge pages by editing /proc/sys/vm/nr_hugepages. Note that only a user with root authority can enable huge pages. After that, madoka::Sketch can use huge pages. Remember that a sketch backed by a file does not support huge pages unless the file system supports huge pages. For more details, see information about huge page support.

Specify a seed

madoka::Sketch provides an interface to specify a seed, which is used to calculate hash values and to initialize a random number generator. The 5th argument of create() specifies the seed. However, basically there is no need to specify a user-defined seed. Note that you cannot merge sketches having different seeds.

Get information of a sketch

madoka::Sketch::width()
- width() returns width of the sketch.
madoka::Sketch::width_mask()
- width_mask() returns width - 1 if width is a power of 2, otherwise returns 0.
madoka::Sketch::depth()
- depth() returns madoka::SKETCH_DEPTH.
madoka::Sketch::max_value()
- max_value() returns max_value of the sketch.
madoka::Sketch::value_mask()
- value_mask() returns max_value.
madoka::Sketch::value_size()
- value_size() returns log₂(max_value + 1).
madoka::Sketch::seed()
- seed() returns seed of the sketch.
madoka::Sketch::table_size()
- table_size() returns the size in bytes of the sketch.
madoka::Sketch::file_size()
- file_size() returns the file size in bytes of the sketch. The file size is equal to the sum of the header size and the sketch size.
madoka::Sketch::flags()
- flags() returns memory mapping related flags.
madoka::Sketch::mode()
- mode() returns madoka::SKETCH_APPROX_MODE if value_size == madoka::SKETCH_APPROX_VALUE_SIZE, otherwise returns madoka::SKETCH_EXACT_MODE.

Catch an exception

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  try {
    sketch.create(madoka::SKETCH_MAX_WIDTH + 1);
  } catch (const madoka::Exception &ex) {
    std::cerr << "error: "
              << ex.what() << std::endl;
    return -1;
  }
  return 0;
}

$ g++ catch-and-exception.cc -lmadoka
$ ./a.out
error: madoka/sketch.cc:453: width > SKETCH_MAX_WIDTH

Madoka throws an exception when an error occurs. The exception class is madoka::Exception. This example shows how to catch an exception.

madoka::Exception::what()
- what() is a function to get an error message. The format is __FILE__ « ":" « __LINE__ « ": " « the-reason-of-the-error.
- Note that exception specifiers in madoka/sketch.h tell you which function is possible to throw an exception.

In this example, create() fails to create a sketch because the specifed width is too large. Remember that madoka::Sketch is guaranteed to be unchanged when an error has occurred.

Draw a croquis

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Croquis<float> croquis;
  croquis.create();

  croquis.set("Madoka", 6, 1.25);
  croquis.set("Hiroshi", 7, 2.5);
  croquis.add("Madoka", 6, 0.5);

  std::cout << "Madoka: "
            << croquis.get("Madoka", 6) << std::endl;
  std::cout << "Hiroshi: "
            << croquis.get("Hiroshi", 7) << std::endl;
  return 0;
}

$ g++ draw-a-croquis.cc -lmadoka
$ ./a.out
Madoka: 1.75
Hiroshi: 2.5

madoka::Croquis is a simplified version of madoka::Sketch. It does not provide inc(), copy(), filter(), shrink(), merge() and inner_product(). Instead, madoka::Croquis has a template parameter that specifies the type of cells.

This example uses floating point numbers with madoka::Croquis<float>. The usage is same as that of madoka::Sketch, but remember that a tiny value may be truncated in add() because of the rounding.