Bird’s eye view#

Background#

Data lineage tracks data’s journey, detailing its origins, transformations, and interactions to trace biological insights, verify experimental outcomes, meet regulatory standards, and increase the robustness of research. While tracking data lineage is easier when it is governed by deterministic pipelines, it becomes hard when its governed by interactive human-driven analyses.

Here, we’ll backtrace file transformations through notebooks, pipelines & app uploads in a research project based on Schmidt22 which conducted genome-wide CRISPR activation and interference screens in primary human T cells to identify gene networks controlling IL-2 and IFN-γ production.

Setup#

We need an instance:

!lamin init --storage ./mydata

Import lamindb:

import lamindb as ln

✅ loaded instance: testuser1/mydata (lamindb 0.51.0)

We simulate the raw data processing of Schmidt22 with toy data in a real world setting with multiple collaborators (here testuser1 and testuser2):

assert ln.setup.settings.user.handle == "testuser1"

bfx_run_output = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
ln.File(bfx_run_output.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(bfx_run_output.parent / "fastq/perturbseq_R2_001.fastq.gz").save()

Track a bioinformatics pipeline#

When working with a pipeline, we’ll register it before running it.

This only happens once and could be done by anyone on your team.

ln.setup.login("testuser2")

✅ logged in with email testuser2@lamin.ai and id bKeW4T6E

❗ record with similar name exist! did you mean to load it?

	id	__ratio__
name
Test User1	DzTjkKse	90.0

✅ saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-08-28 14:21:05)

transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")

ln.User.filter().df()

	handle	email	name	updated_at
id
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 14:21:03
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 14:21:05

transform

Transform(id='XKMOyYO3PQgrD3', name='Cell Ranger', version='7.2.0', type='pipeline', created_by_id='bKeW4T6E')

ln.track(transform)

✅ saved: Transform(id='XKMOyYO3PQgrD3', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-08-28 14:21:05, created_by_id='bKeW4T6E')

✅ saved: Run(id='V2XJCq8Eh95X8GmvaU2A', run_at=2023-08-28 14:21:05, transform_id='XKMOyYO3PQgrD3', created_by_id='bKeW4T6E')

Now, let’s stage a few files from an instrument upload:

files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

💡 adding file CBlOB54RhdJy9AW7uj0q as input for run V2XJCq8Eh95X8GmvaU2A, adding parent transform nnvoBKqmX1t2MY

💡 adding file 1PrHf91xBNVIdmEmnESf as input for run V2XJCq8Eh95X8GmvaU2A, adding parent transform nnvoBKqmX1t2MY

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

Let’s look at the data lineage at this stage:

output_files[0].view_lineage()

https://d33wubrfki0l68.cloudfront.net/3a83efe8d0f15a4b2643e2b4f77d17a303f7e5e2/9bf47/_images/f2cd4c16a44ee7ca44723a77675d47cccdc174d2fc32e79534ad3d4928f5a839.svg

And let’s keep running the Cell Ranger pipeline in the background.

Track app upload & analytics#

The hidden cell below simulates additional analytic steps including:

uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets

Let’s see what the data lineage of this looks:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/8eb77933d097ac4de53241dc96907119f0ccd21b/854a6/_images/8149039419aca80ebbd5b97fb5d2127db8d792dc4fe17f0cd4672b09693a10fe.svg

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

Show code cell content Hide code cell content

# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

✅ saved: Transform(id='77xjp9OwkGGwth', name='Perform single cell analysis, integrating with CRISPRa screen', type='notebook', updated_at=2023-08-28 14:21:10, created_by_id='bKeW4T6E')

✅ saved: Run(id='slCK9KQI5DpR08oIcYvo', run_at=2023-08-28 14:21:10, transform_id='77xjp9OwkGGwth', created_by_id='bKeW4T6E')

💡 adding file ZCcX5DBKHm9DU2Ybemf2 as input for run slCK9KQI5DpR08oIcYvo, adding parent transform caxQSPA6QV3C6e

💡 adding file Cdu0q9sp5SdM4Q2t4bUB as input for run slCK9KQI5DpR08oIcYvo, adding parent transform nv8koAu7vWQ7yJ

WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png

💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'

✅ storing file 'waseTZpuKcgAyK5dqsiI' at 'figures/umap_fig1_score-wgs-hits.png'

WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png

💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

✅ storing file '0g0vwXFUQi4tXsnDZBgG' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:

Track notebooks#

We’d now like to track the current Jupyter notebook to continue the work:

ln.track()

💡 notebook imports: ipython==8.14.0 lamindb==0.51.0 scanpy==1.9.4

✅ saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', version='0', type=notebook, updated_at=2023-08-28 14:21:12, created_by_id='bKeW4T6E')

✅ saved: Run(id='VwarMAJPHZtHdu1Ay7D9', run_at=2023-08-28 14:21:12, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

Visualize data lineage#

Let’s load one of the plots:

file = ln.File.filter(key__contains="figures/matrixplot").one()

from IPython.display import Image, display

file.stage()
display(Image(filename=file.path))

💡 adding file 0g0vwXFUQi4tXsnDZBgG as input for run VwarMAJPHZtHdu1Ay7D9, adding parent transform 77xjp9OwkGGwth

https://d33wubrfki0l68.cloudfront.net/dcbd1e67232f2ede82171ba02237575cc586c2b7/1ceff/_images/45891ad4693b5bfeb52a48b2ab2e5d0a82220b9482360ee1a8757fad581fffdc.png

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/545590abd347246e5be604a8a81414577a4bc9ad/1b421/_images/4a703d8635924180bd705bd857b8c767e2148a3712690c16c821416764854ff4.svg

Alternatively, we can also purely look at the sequence of transforms and ignore the files:

transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()

transform.parents.df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
77xjp9OwkGGwth	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:21:11	bKeW4T6E

transform.view_parents()

https://d33wubrfki0l68.cloudfront.net/8b2c3a1f762fc38ceaa6461d419483a05cbb5426/1eb4a/_images/e5fff154f065dc8c5e4aafbf78a174ba829031c21473c688b16e12b247cde619.svg

Understand runs#

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

File objects are the inputs and outputs of runs.

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
Cdu0q9sp5SdM4Q2t4bUB	14knrNE6	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	nv8koAu7vWQ7yJ	kdj1Slelm88rpqWQkS5G	2023-08-28 14:21:10	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

Transform(id='nnvoBKqmX1t2MY', name='Chromium 10x upload', type='pipeline', updated_at=2023-08-28 14:21:04, created_by_id='DzTjkKse')

And which user?

file.created_by

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:21:07)

Which transforms were created by a given user?

users = ln.User.lookup()

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
XKMOyYO3PQgrD3	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 14:21:05	bKeW4T6E
caxQSPA6QV3C6e	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 14:21:06	bKeW4T6E
nv8koAu7vWQ7yJ	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:21:10	bKeW4T6E
77xjp9OwkGGwth	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:21:11	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:21:12	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
nv8koAu7vWQ7yJ	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:21:10	bKeW4T6E
77xjp9OwkGGwth	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:21:11	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:21:12	bKeW4T6E

We can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
0g0vwXFUQi4tXsnDZBgG	14knrNE6	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	None	28814	JYIPcat0YWYVCX3RVd3mww	md5	77xjp9OwkGGwth	slCK9KQI5DpR08oIcYvo	2023-08-28 14:21:12	bKeW4T6E
waseTZpuKcgAyK5dqsiI	14knrNE6	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	None	118999	laQjVk4gh70YFzaUyzbUNg	md5	77xjp9OwkGGwth	slCK9KQI5DpR08oIcYvo	2023-08-28 14:21:11	bKeW4T6E
Cdu0q9sp5SdM4Q2t4bUB	14knrNE6	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	nv8koAu7vWQ7yJ	kdj1Slelm88rpqWQkS5G	2023-08-28 14:21:10	bKeW4T6E
9R6ZwEcOz75YI3cubpzC	14knrNE6	schmidt22-crispra-gws-IFNG.csv	.csv	None	Raw data of schmidt22 crispra GWS	None	None	1729685	cUSH0oQ2w-WccO8_ViKRAQ	md5	bo9nh9PJkfNZ3E	QDDycyB2eCEcaW67vJ3y	2023-08-28 14:21:09	DzTjkKse
ZCcX5DBKHm9DU2Ybemf2	14knrNE6	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	caxQSPA6QV3C6e	m4LHPEv874Tr5TQpSjXZ	2023-08-28 14:21:06	bKeW4T6E
Yl7zC34R1z8WfIfTFIKb	14knrNE6	perturbseq/filtered_feature_bc_matrix/barcodes...	.tsv.gz	None	None	None	None	6	G4l1tW72nOSg-q858lYARg	md5	XKMOyYO3PQgrD3	V2XJCq8Eh95X8GmvaU2A	2023-08-28 14:21:05	bKeW4T6E
GIZ6lStUW8WPdbpxLlz0	14knrNE6	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	None	6	wn8u1-AVw1I-KHwzJiZ7-g	md5	XKMOyYO3PQgrD3	V2XJCq8Eh95X8GmvaU2A	2023-08-28 14:21:05	bKeW4T6E

Run

	transform_id	run_at	created_by_id	reference	reference_type
id
tTdod4HEQTSmbJrr9Bz6	nnvoBKqmX1t2MY	2023-08-28 14:21:04	DzTjkKse	None	None
V2XJCq8Eh95X8GmvaU2A	XKMOyYO3PQgrD3	2023-08-28 14:21:05	bKeW4T6E	None	None
m4LHPEv874Tr5TQpSjXZ	caxQSPA6QV3C6e	2023-08-28 14:21:05	bKeW4T6E	None	None
QDDycyB2eCEcaW67vJ3y	bo9nh9PJkfNZ3E	2023-08-28 14:21:07	DzTjkKse	None	None
kdj1Slelm88rpqWQkS5G	nv8koAu7vWQ7yJ	2023-08-28 14:21:10	bKeW4T6E	None	None
slCK9KQI5DpR08oIcYvo	77xjp9OwkGGwth	2023-08-28 14:21:10	bKeW4T6E	None	None
VwarMAJPHZtHdu1Ay7D9	1LCd8kco9lZUz8	2023-08-28 14:21:12	bKeW4T6E	None	None

Storage

	root	type	region	updated_at	created_by_id
id
14knrNE6	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-08-28 14:21:03	DzTjkKse

Transform

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:21:12	bKeW4T6E
77xjp9OwkGGwth	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:21:11	bKeW4T6E
nv8koAu7vWQ7yJ	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:21:10	bKeW4T6E
bo9nh9PJkfNZ3E	Upload GWS CRISPRa result	None	None	None	app	None	2023-08-28 14:21:09	DzTjkKse
caxQSPA6QV3C6e	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 14:21:06	bKeW4T6E
XKMOyYO3PQgrD3	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 14:21:05	bKeW4T6E
nnvoBKqmX1t2MY	Chromium 10x upload	None	None	None	pipeline	None	2023-08-28 14:21:04	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 14:21:10
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 14:21:07