Overview

sist2 (simple incremental search tool) is a more powerful and more lightweight version of its Python predecessor. It is currently being used to allow full-text search of terabytes of online documents such as scientific papers and comic books at the-eye.eu.

It can parse many common file types (See README.md for the updated list) and will extract text from their metadata and contents.

The indexing process is typically done in three steps: scan, index then web. For example:

sist2 scan ./my_documents/ -o idx/

After this step, the raw index (./idx/) has been created and direct access to the files is no longer necessary. This means that you can pass around the raw index folder or use sist2 to index files stored on cold storage.

The index step will convert the raw index into JSON documents and push them to Elasticsearch. Sist2 is compatible with versions 6.X and 7.X.

# Start a debug elasticsearch instance
docker run -d -p 9201:9200 \
	-e "discovery.type=single-node" \
	docker.elastic.co/elasticsearch/elasticsearch:7.4.2

# The --force-reset flag tells sist2 to (re)initialize
#  the Elasticsearch mappings & settings
sist2 index idx/ --force-reset --es-url http://localhost:9201
sist2 web idx/ --port 8080
# Starting web server @ http://localhost:8080

Web interface

The web module can serve the search interface on its own without additional configuration. What’s interesting to note is that the files themselves can either be served by a remote HTTP server that acts as an external CDN, or they can be served by sist2 directly from the disk. In the latter case, Partial Content is supported, meaning that Range requests are accepted and media files can be ‘seeked’ from the browser.

The UI itself is not that much different from the original Python/Flask version, however, the Javascript client is a bit thicker, meaning that most operations that were originally handled by the Flask server, such as auto-complete and the retrieval of the mime type list are done client side.

This is possible because Elasticsearch queries are proxied as is through sist2, for example, the mime type selection widget is populated with a function similar to this:

$.post("es", {
	// Elasticsearch query body
    aggs: {
        mimeTypes: {
            terms: {
                field: "mime",
                size: 10000
            }
        }
    },
    size: 0,
}).then(resp => {
    resp["aggregations"]["mimeTypes"]["buckets"].forEach(bucket => {
		console.log(bucket);
		//...
	});
});

Another improvement was to re-skin the whole page to allow users to choose the dark OLED-friendly theme. Pressing on the theme toggle button sets Cookie: sist=dark, which tells sist2 to serve different content depending on the value of the cookie.

Web interface (Dark theme) displaying Occult Library books

Thumbnail storage

An LMDB (Lightning Memory-Mapped Database) key-value store is used to asynchronously save the thumbnails as they are generated by the indexer. Once the scan step is done, the database file is used by the web module to serve the thumbnails with very little latency.

Since the database is mapped in memory (See mmap(2)), the web process may appear to have a high memory usage under load, but almost all of it is allocated for the data.mdb file. In fact if we take a look with pmap, we can see that virtually all of the resident memory is used for LMDB and that none of it is dirty. This means that the operating system will eventually reclaim the memory and that, over time, the memory usage will return to ~20M.

$pmap -x <PID>

Adress           Kbytes     RSS   Dirty Mode  Mapping
00005641d5689000   21300     536       0 r-x-- sist2
00005641d5689000       0       0       0 r-x-- sist2
00005641d6d56000     432       8       0 r-x-- sist2
00005641d6d56000       0       0       0 r-x-- sist2
00005641d6dc2000   32696     768       8 rwx-- sist2
00005641d6dc2000       0       0       0 rwx-- sist2
00005641d8db0000    8452     100       4 rwx--   [ anon ]
...
00007fd1d7419000 3180068  160000       0 rwxs- data.mdb
00007fd2998a4000 2290452     240       0 rwxs- data.mdb
00007fd32586b000 10721328   64868       0 rwxs- data.mdb
00007fd5b4179000 3535892   51616       0 rwxs- data.mdb
00007fd68c180000 4446024  118668       0 rwxs- data.mdb
00007fd79ba54000 1411416   47992       0 rwxs- data.mdb
00007fd7f1fac000  560000    6044       0 rwxs- data.mdb
00007fd81458e000 9069792  217464       0 rwxs- data.mdb
...
00007fda42736000    2048       0       0 ----- libc-2.24.so
---------------- ------- ------- ------- 
total kB         36085472  683468   10472

Media Files

All audio and video files are handled by ffmpeg’s libav* libraries, which is extremely helpful since we can handle all audio/*, video/* and image/* (images are videos that have only one frame), file types the same way. For instance, there is no difference in the code between thumbnails that are generated from the embedded cover art of a .mp3 file versus thumbnails generated from a video stream of a .mkv container. We also don’t have to worry about odd encodings because ffmpeg is bundled with hundreds of decoders.

Font Files

Font files were especially painful to work with, since I had to implement the code to generate the thumbnails mostly from scratch. Each letter is individually drawn into a bitmap, which is then converted to uncompressed BMP Format and saved directly to disk. Thankfully, most font faces are relatively standard, in that they are meant to be displayed from left to right, and glyphs for the basic Latin alphabet are available.

For the rest, I would mostly have to handle each corner case one by one. At the time of writing this, I gave up on trying to render atypical font faces.

Raw Index Binary Format

For simplicity’s sake, the document metadata structure is dumped directly from memory to file without much additional processing. While it’s not as space-efficient as it could be, it’s much more (about 350%) smaller than the equivalent in JSON.

idx/_index_<pid>

000  e5 94 64 1d 82 91 4f 25  80 31 2b 69 db 23 14 79  ..d...O%.1+i.#.y
010  dd 00 84 00 31 08 00 00  10 fa 27 00 00 00 00 00  ....1.....'.....
020  8a 01 06 00 cc ea a7 5c  00 00 08 00 00 00 00 00  .......\........
030  62 6f 62 72 6f 73 73 2e  77 65 62 6d 00 f6 8b 00  bobross.webm....
040  00 00 f2 00 05 00 00 f3  d0 02 00 00 0a         

This, of course, makes little difference since neither format is needed after it has been indexed to Elasticsearch.

(Elasticsearch JSON document)

{
  "_id": "e594641d-8291-4f25-8031-2b69db231479",
  "_index": "sist2",
  "_type": "_doc",
  "_source": {
    "index": "bb3d8cc5-2e5c-4f1c-ac04-1b1f6d9b070a",
    "mime": "video/webm",
    "size": 2619920,
    "mtime": 1554508492,
    "extension": "webm",
    "name": "bobross",
    "path": "",
    "videoc": "vp8",
    "width": 1280,
    "height": 720
  }
}