C++ API Documentation

Draw a sketch

#include <iostream>
#include <string>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  std::string key;
  while (std::getline(std::cin, key)) {
    sketch.inc(key.c_str(), key.length());
  }
  sketch.save(argv[1]);
  return 0;
}
$ g++ draw-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET

Let’s try to draw a sketch. This example reads a keyset from standard input (std::cin) and draws a sketch (madoka::Sketch). Then, this example saves the sketch to a file specified by the 1st command line argument (argv[1]). The following are the points of this example.

Note that an option -lmadoka is needed to build this example. If you have installed Madoka, pkg-config madoka --libs is available to get the required options.

Look at a sketch

#include <iostream>
#include <string>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.load(argv[1]);
  // sketch.open(argv[1]);

  std::string key;
  while (std::getline(std::cin, key)) {
    std::cout << key << ": "
              << sketch.get(key.c_str(), key.length()) << std::endl;
  }
  return 0;
}
$ g++ look-at-a-sketch.cc -lmadoka
$ ./a.out SKETCH < KEYSET

Next, let’s try to look at a sketch. This example loads a sketch from a file specified by the 1st command line argument (argv[1]). Then, this example looks up keys read from standard input (std::cin). The following are the points of this example.

h3. Use other brushes

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("QB", 2, 10);
  std::cout << "QB: "
            << sketch.add("QB", 2, 5) << std::endl;
  sketch.set("QB", 2, 7);
  std::cout << "QB: "
            << sketch.get("QB", 2) << std::endl;
  return 0;
}
$ g++ use-other-brushes.cc -lmadoka
$ ./a.out
QB: 15
QB: 15

madoka::Sketch provides other drawing functions named set() and add(). This example shows how these functions work.

In this example, the 1st set() changes the value associated with "QB" from 0 to 10. Then, add() adds 5 to that value (from 10 to 15). After that, the 2nd set() does nothing because the specified value (7) is less than the current value (15).

Customize a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  const madoka::UInt64 MY_WIDTH = 1ULL << 24;
  const madoka::UInt64 MY_MAX_VALUE = 10;

  madoka::Sketch sketch;
  sketch.create(MY_WIDTH, MY_MAX_VALUE);

  std::cout << "width: " << sketch.width() << std::endl;
  std::cout << "max_value: " << sketch.max_value() << std::endl;
  std::cout << "size: " << sketch.file_size() << std::endl;
  return 0;
}
$ g++ customize-a-sketch.cc -lmadoka
$ ./a.out
width: 16777216
max_value: 15
size: 25165904

Let’s customize a sketch for your application. A Count-Min sketch is a probabilistic data structure and the accuracy depends on its parameters, width and depth, and the target data stream.

Basically, accurate sketching requires a large width and a longer stream requires a larger width, but a larger width requires a larger memory space and increases cache misses. The other parameter, depth, also has an effect on accuracy but Madoka uses a fixed depth, actually madoka::SKETCH_DEPTH (3), based on benchmarks. Instead, Madoka has another parameter, max_value, that specifies the upper limit of the values. By using a small max_value, you can save memory.

To customize a sketch, specify width and max_value when creating a sketch. See the following for more details.

This example creates a customized sketch and prints the size. Note that the size in bytes of a sketch is approximately width x 8 if max_value == 245 - 1, or otherwise width x depth x log2(max_value + 1) / 8.

Clear a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Sayaka", 6, 100);
  sketch.clear();
  std::cout << "Sayaka: "
            << sketch.get("Sayaka", 6) << std::endl;
  return 0;
}
$ g++ clear-a-sketch.cc -lmadoka
$ ./a.out
Sayaka: 0

madoka::Sketch provides an interface to clear a sketch. This example creates a sketch and updates the value associated with "Sayaka" from 0 to 100, but clear() fills the sketch with 0s. As a result, get() returns 0 for "Sayaka".

Copy a sketch

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Kyoko", 5, 100);

  madoka::Sketch snapshot;
  snapshot.copy(sketch);

  sketch.add("Kyoko", 5, 50);

  std::cout << "Kyoko (original): "
            << sketch.get("Kyoko", 5) << std::endl;
  std::cout << "Kyoko (snapshot): "
            << snapshot.get("Kyoko", 5) << std::endl;
  return 0;
}
$ g++ copy-a-sketch.cc -lmadoka
$ ./a.out
Kyoko (original): 150
Kyoko (snapshot): 100

madoka::Sketch provides an interface to copy a sketch. This example draws a sketch and creates its copy as a snapshot. Then, this example updates the original sketch.

As shown in this example, copy() is useful to create a snapshot in memory. If you want to save a snapshot as a file, save() is an easier choice.

Apply a filter

#include <iostream>

#include <madoka.h>

madoka::UInt64 divide_by_2(madoka::UInt64 value) {
  return value / 2;
}

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create();

  sketch.set("Kaname", 6, 8);
  sketch.set("Madoka", 6, 15);

  sketch.filter(divide_by_2);
  // sketch.filter([](madoka::UInt64 value) { return value / 2; });

  std::cout << "Kaname: "
            << sketch.get("Kaname", 6) << std::endl;
  std::cout << "Madoka: "
            << sketch.get("Madoka", 6) << std::endl;
  return 0;
}
$ g++ apply-a-filter.cc -lmadoka
$ ./a.out
Kaname: 4
Madoka: 7

madoka::Sketch provides a filter feature which can be used to reduce errors and to simulate decays. This example uses a filter for dividing all the values in a sketch by 2.

For example, by using this feature, you can implement a variety of lossy conservative updates. Also, you can divide all the values by 2 when one of the values reaches max_value. In addition, you can replace all the values with their binary logarithms for compression, etc.

Shrink a sketch

#include <iostream>

#include <madoka.h>

madoka::UInt64 logarithmize(madoka::UInt64 value) {
  return 63 - ::__builtin_clzll(value | 1);
}

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create(100);

  sketch.set("Akemi", 5, 256);
  sketch.set("Homura", 6, 16777216);

  madoka::Sketch new_sketch;
  new_sketch.shrink(sketch, 10, 15, logarithmize);

  std::cout << "width: "
            << new_sketch.width() << std::endl;
  std::cout << "max_value: "
            << new_sketch.max_value() << std::endl;

  std::cout << "Akemi: "
            << new_sketch.get("Akemi", 5) << std::endl;
  std::cout << "Homura: "
            << new_sketch.get("Homura", 6) << std::endl;
  return 0;
}
$ g++ shrink-a-sketch.cc -lmadoka
$ ./a.out
width: 10
max_value: 15
Akemi: 8
Homura: 15

Let’s shrink a sketch for saving memory. This is a reasonable answer to the question “How can I determine the best values for width and max_value?”.

During or after sketching, you may find that width and max_value are too large. In such a case, you can shrink the sketch, or more precisely, you can create a smaller sketch and copies the contents of the source sketch to the smaller sketch.

This example creates a sketch and shrinks the sketch to a smaller sketch. The new width and the new max_value are 10 and 15 respectively. Note that the values are replaced with their logarithms by logarimize(). For example, "Akemi: 256" is replaced with "Akemi: 8". Also note that "Homura: 16777216" is replaced with "Homura: 24" but the output is "Homura: 15" because of saturation.

Due to this feature, you can use large width and max_value during sketching, and after that, you can adjust width and max_value to your application. Also, you can compress a sketch as shown in this example, binarize a sketch based on a threshold, etc.

Merge sketches

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch_1, sketch_2;
  sketch_1.create();
  sketch_2.create();

  sketch_1.inc("Tomoe", 5);
  sketch_1.inc("Mami", 4);
  sketch_2.inc("Mami", 4);

  madoka::Sketch sketch;
  sketch.copy(sketch_1);
  sketch.merge(sketch_2);

  std::cout << "Tomoe: "
            << sketch.get("Tomoe", 5) << std::endl;
  std::cout << "Mami: "
            << sketch.get("Mami", 4) << std::endl;
  return 0;
}
$ g++ merge-sketches.cc -lmadoka
$ ./a.out
Tomoe: 1
Mami: 2

madoka::Sketch provides an interface to merge sketches. A sketch merging works like vector addition. It simply adds the values of the right-hand side sketch (rhs-sketch) to the values of the left-hand side sketch (lhs-sketch).

This example creates two sketches with "Mami: 1" and merges the sketches. So, the resultant sketch has "Mami: 2". Note that this example uses copy() to keep the lhs-sketch. If you don’t mind overwriting the lhs-sketch, you can call merge() directly.

The merging feature allows you to draw a sketch in a distributed manner. For example, you can divide an input data stream into several streams and draw sketches for each stream in parallel. Then, you can merge the temporary sketches to get the final sketch.

Estimate the inner product

#include <cmath>
#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch_1, sketch_2;
  sketch_1.create();
  sketch_2.create();

  sketch_1.add("Charlotte", 9, 3);
  sketch_1.add("Oktavia", 7, 2);
  sketch_2.add("Gretchen", 8, 5);
  sketch_2.add("Charlotte", 9, 4);

  double length_1, length_2;
  double inner_product =
      sketch_1.inner_product(sketch_2, &length_1, &length_2);
  length_1 = std::sqrt(length_1);
  length_2 = std::sqrt(length_2);

  std::cout << "inner_product: " << inner_product << std::endl;
  std::cout << "length_1: " << length_1 << std::endl;
  std::cout << "length_2: " << length_2 << std::endl;
  std::cout << "cosine: "
            << (inner_product / length_1 / length_2) << std::endl;
  return 0;
}
$ g++ estimate-the-inner-product.cc -lmadoka
$ ./a.out
inner_product: 12
length_1: 3.60555
length_2: 6.40312
cosine: 0.519778

A Count-Min sketch supports inner product estimation. madoka::Sketch provides an interface to estimate the inner product with the length of sketches, actually length is not defined for sketches. This interface is useful to estimate the cosine similarity.

This example creates two sketches and estimates the inner product between the sketches. This example also computes the cosine similarity from the obtained values.

Configure memory mapping

namespace madoka {

enum FileFlag {
  FILE_CREATE    = 1 << 0,
  FILE_TRUNCATE  = 1 << 1,
  FILE_READONLY  = 1 << 2,
  FILE_WRITABLE  = 1 << 3,
  FILE_SHARED    = 1 << 4,
  FILE_PRIVATE   = 1 << 5,
  FILE_ANONYMOUS = 1 << 6,
  FILE_HUGETLB   = 1 << 7,
  FILE_PRELOAD   = 1 << 8
};

}  // namespace madoka

create(), open(), load(), save(), copy() and shrink() have arguments named path and flags. The path argument specifies the target file. Note that NULL specifies to create an anonymous memory mapping. The flags argument specifies the behavior as follows.

Use huge pages

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  sketch.create(0, 0, NULL, madoka::FILE_HUGETLB);

  if (sketch.flags() & madoka::FILE_HUGETLB) {
    std::cout << "HugeTLB: on" << std::endl;
  } else {
    std::cout << "HugeTLB: off" << std::endl;
  }
  return 0;
}
$ g++ use-huge-pages.cc -lmadoka
$ grep HugePages_Free /proc/meminfo
HugePages_Free:        0
$ ./a.out
HugeTLB: off
$ sudo sysctl -w vm.nr_hugepages=512
vm.nr_hugepages = 512
$ grep HugePages_Free /proc/meminfo
HugePages_Free:      512
$ ./a.out
HugeTLB: on

Roughly speaking, a Count-Min sketch is composed of hash tables and the number of hash tables is equal to its depth. The depth of madoka::Sketch is fixed to 3 and thus basic operations, set(), inc() and add(), except get(), perform at least 3 random accesses.

Because of the random access nature, if a sketch is larger than the CPU cache, cache misses occur in sketching and the memory access latency becomes the bottleneck. In addition, if a sketch is much, much, much larger than the CPU cache, TLB misses reduce the throughput of sketching.

madoka::FILE_HUGETLB is an optional flag to enable the use of huge pages. The use of huge pages reduces TLB misses in sketching. If a sketch is created with madoka::FILE_HUGETLB, the sketch tries to use huge pages. If huge pages are not available, the sketch uses regular pages.

This example shows how to use huge pages. You can check whether huge pages are available or not by reading /proc/meminfo. If disabled, you can enable huge pages by editing /proc/sys/vm/nr_hugepages. Note that only a user with root authority can enable huge pages. After that, madoka::Sketch can use huge pages. Remember that a sketch backed by a file does not support huge pages unless the file system supports huge pages. For more details, see information about huge page support.

Specify a seed

madoka::Sketch provides an interface to specify a seed, which is used to calculate hash values and to initialize a random number generator. The 5th argument of create() specifies the seed. However, basically there is no need to specify a user-defined seed. Note that you cannot merge sketches having different seeds.

Get information of a sketch

Catch an exception

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Sketch sketch;
  try {
    sketch.create(madoka::SKETCH_MAX_WIDTH + 1);
  } catch (const madoka::Exception &ex) {
    std::cerr << "error: "
              << ex.what() << std::endl;
    return -1;
  }
  return 0;
}
$ g++ catch-and-exception.cc -lmadoka
$ ./a.out
error: madoka/sketch.cc:453: width > SKETCH_MAX_WIDTH

Madoka throws an exception when an error occurs. The exception class is madoka::Exception. This example shows how to catch an exception.

In this example, create() fails to create a sketch because the specifed width is too large. Remember that madoka::Sketch is guaranteed to be unchanged when an error has occurred.

Draw a croquis

#include <iostream>

#include <madoka.h>

int main(int argc, char *argv[]) {
  madoka::Croquis<float> croquis;
  croquis.create();

  croquis.set("Madoka", 6, 1.25);
  croquis.set("Hiroshi", 7, 2.5);
  croquis.add("Madoka", 6, 0.5);

  std::cout << "Madoka: "
            << croquis.get("Madoka", 6) << std::endl;
  std::cout << "Hiroshi: "
            << croquis.get("Hiroshi", 7) << std::endl;
  return 0;
}
$ g++ draw-a-croquis.cc -lmadoka
$ ./a.out
Madoka: 1.75
Hiroshi: 2.5

madoka::Croquis is a simplified version of madoka::Sketch. It does not provide inc(), copy(), filter(), shrink(), merge() and inner_product(). Instead, madoka::Croquis has a template parameter that specifies the type of cells.

This example uses floating point numbers with madoka::Croquis</code>. The usage is same as that of madoka::Sketch, but remember that a tiny value may be truncated in add() because of the rounding.