Database schema

Standalone conda-index uses a per-subdir sqlite database to track package metadata, unlike the older version which used millions of tiny .json files. The new strategy is much faster because we don’t have to pay for many individual stat() or open() calls.

The whole schema looks like this:

<subdir>/.cache % sqlite3 cache.db
SQLite version 3.41.2 2023-03-22 11:56:21
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE about (path TEXT PRIMARY KEY, about BLOB);
CREATE TABLE index_json (path TEXT PRIMARY KEY, index_json BLOB);
CREATE TABLE recipe (path TEXT PRIMARY KEY, recipe BLOB);
CREATE TABLE recipe_log (path TEXT PRIMARY KEY, recipe_log BLOB);
CREATE TABLE run_exports (path TEXT PRIMARY KEY, run_exports BLOB);
CREATE TABLE post_install (path TEXT PRIMARY KEY, post_install BLOB);
CREATE TABLE icon (path TEXT PRIMARY KEY, icon_png BLOB);
CREATE TABLE stat (
                stage TEXT NOT NULL DEFAULT 'indexed',
                path TEXT NOT NULL,
                mtime NUMBER,
                size INTEGER,
                sha256 TEXT,
                md5 TEXT,
                last_modified TEXT,
                etag TEXT
            );
CREATE UNIQUE INDEX idx_stat ON stat (path, stage);
CREATE INDEX idx_stat_stage ON stat (stage, path);
sqlite> select stage, path from stat where path like 'libcurl%';
fs|libcurl-7.84.0-hc6d1d07_0.conda
fs|libcurl-7.86.0-h0f1d93c_0.conda
fs|libcurl-7.87.0-h0f1d93c_0.conda
fs|libcurl-7.88.1-h0f1d93c_0.conda
fs|libcurl-7.88.1-h9049daf_0.conda
indexed|libcurl-7.84.0-hc6d1d07_0.conda
indexed|libcurl-7.86.0-h0f1d93c_0.conda
indexed|libcurl-7.87.0-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h9049daf_0.conda

Most of these tables store json-format metadata extracted from each package.

select * from index_json where path = 'libcurl-7.88.1-h9049daf_0.conda';
'libcurl-7.88.1-h9049daf_0.conda'
'{"build":"h9049daf_0",...,"sha256":"37b8d58c05386ac55d1d8e196c90b92b0a63f3f1fe2fa916bf5ed3e1656d8e14","size":321706}'

Sample queries

Megabytes added per day:

select
  date(mtime, 'unixepoch') as d,
  printf('%0.2f', sum(size) / 1e6) as MB
from
  stat
group by
  date(mtime, 'unixepoch')
order by
  mtime desc