# Database schema

Standalone conda-index uses a per-subdir sqlite database to track package
metadata, unlike the older version which used millions of tiny `.json` files.
The new strategy is much faster because we don't have to pay for many individual
`stat()` or `open()` calls.

The whole schema looks like this:

```sql
<subdir>/.cache % sqlite3 cache.db
SQLite version 3.41.2 2023-03-22 11:56:21
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE about (path TEXT PRIMARY KEY, about BLOB);
CREATE TABLE index_json (path TEXT PRIMARY KEY, index_json BLOB);
CREATE TABLE recipe (path TEXT PRIMARY KEY, recipe BLOB);
CREATE TABLE recipe_log (path TEXT PRIMARY KEY, recipe_log BLOB);
CREATE TABLE run_exports (path TEXT PRIMARY KEY, run_exports BLOB);
CREATE TABLE post_install (path TEXT PRIMARY KEY, post_install BLOB);
CREATE TABLE icon (path TEXT PRIMARY KEY, icon_png BLOB);
CREATE TABLE stat (
                stage TEXT NOT NULL DEFAULT 'indexed',
                path TEXT NOT NULL,
                mtime NUMBER,
                size INTEGER,
                sha256 TEXT,
                md5 TEXT,
                last_modified TEXT,
                etag TEXT
            );
CREATE UNIQUE INDEX idx_stat ON stat (path, stage);
CREATE INDEX idx_stat_stage ON stat (stage, path);
```

```sql
sqlite> select stage, path from stat where path like 'libcurl%';
fs|libcurl-7.84.0-hc6d1d07_0.conda
fs|libcurl-7.86.0-h0f1d93c_0.conda
fs|libcurl-7.87.0-h0f1d93c_0.conda
fs|libcurl-7.88.1-h0f1d93c_0.conda
fs|libcurl-7.88.1-h9049daf_0.conda
indexed|libcurl-7.84.0-hc6d1d07_0.conda
indexed|libcurl-7.86.0-h0f1d93c_0.conda
indexed|libcurl-7.87.0-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h9049daf_0.conda
```

Most of these tables store json-format metadata extracted from each package.

```sql
select * from index_json where path = 'libcurl-7.88.1-h9049daf_0.conda';
'libcurl-7.88.1-h9049daf_0.conda'
'{"build":"h9049daf_0",...,"sha256":"37b8d58c05386ac55d1d8e196c90b92b0a63f3f1fe2fa916bf5ed3e1656d8e14","size":321706}'
```

To track whether a package is indexed in the cache or not, conda-index uses a
table named `stat`. The main point of this table is to assign a stage value to
each artifact filename; usually `'fs'` which is called the `upstream` stage, and
`'indexed'`. `'fs'` means that the artifact is now available in the set of
packages (assumed by default to be the local filesystem). `'indexed'` means that
the entry already exists in the database (same filename, same timestamp, same
hash), and its package metadata has been extracted to the `index_json` etc.
tables. Paths in `'fs'` but not in `'indexed'` need to be unpacked to have their
metadata added to the database. Paths in `'indexed'` but not in `'fs'` will be
ignored and left out of `repodata.json`.

First, conda-index adds all files in a subdir to the `upstream` stage. This
involves a `listdir()` and `stat()` for each file in the index. The default
`upstream` stage is named `fs`, but this step is designed to be overridden by
subclassing `CondaIndexCache()` and replacing the `save_fs_state()` and
`changed_packages()` methods. By overriding `CondexIndexCache()` it is possible
to index without calling `stat()` on each package, or without even having all
packages stored on the indexing machine.

Next, conda-index looks for all `changed_packages()`: paths in the `upstream`
(`fs`) stage that don't exist in or have a different  modification time than
those in thie `indexed` stage.

Finally, a join between the `upstream` stage, usually `'fs'`, and the
`index_json` table yields a basic `repodata_from_packages.json` without any
repodata patches.

```sql
SELECT path, index_json FROM stat JOIN index_json USING (path) WHERE stat.stage = :upstream_stage
```

The steps to create `repodata.json`, including any repodata patches, and to
create `current_repodata.json` with only the latest versions of each package,
are similar to pre-sqlite3 conda-index.

The other cached metadata tables are used to create `channeldata.json`.


## Sample queries

Megabytes added per day:

```sql
select
  date(mtime, 'unixepoch') as d,
  printf('%0.2f', sum(size) / 1e6) as MB
from
  stat
group by
  date(mtime, 'unixepoch')
order by
  mtime desc
```