Validate & register bulk RNA-seq datasets#

!lamin init --storage test-bulkrna --schema bionty

import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad

✅ loaded instance: testuser1/test-bulkrna (lamindb 0.51.0)

Let’s start simulating the output of a bulk RNA-seq pipeline:

Curate the count matrix #

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.51.0 lnschema_bionty==0.30.0 pandas==1.5.3

✅ saved: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:19:05, created_by_id='DzTjkKse')

✅ saved: Run(id='TnGfR4Zt3bidzLPLH27h', run_at=2023-08-28 14:19:05, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')

Let’s query the file:

file = ln.File.filter(description="Merged Bulk RNA counts").one()

df = file.load()

💡 adding file Hx0Ge2qqFuzFi78VrgTN as input for run TnGfR4Zt3bidzLPLH27h, adding parent transform vFJyCpNqUbP3hW

If we look at it, we realize it deviates far from the tidy data standard [Wickham14], conventions of statistics & machine learning [Hastie09, Murphy12] and the major Python & R data packages.

Variables aren’t in columns and observations aren’t in rows:

df

	gene_id	gene_name	RAP1_IAA_30M_REP1	RAP1_UNINDUCED_REP1	RAP1_UNINDUCED_REP2	WT_REP1	WT_REP2
0	Gfp_transgene_gene	Gfp_transgene_gene	0.0	0.000	0.0	0.0	0.0
1	HRA1	HRA1	0.0	8.572	0.0	0.0	0.0
2	snR18	snR18	3.0	8.000	4.0	8.0	8.0
3	tA(UGC)A	TGA1	0.0	0.000	0.0	0.0	0.0
4	tL(CAA)A	SUP56	0.0	0.000	0.0	0.0	0.0
...	...	...	...	...	...	...	...
120	YAR064W	YAR064W	0.0	2.000	0.0	0.0	0.0
121	YAR066W	YAR066W	3.0	13.000	8.0	5.0	11.0
122	YAR068W	YAR068W	9.0	28.000	24.0	5.0	7.0
123	YAR069C	YAR069C	0.0	0.000	0.0	0.0	1.0
124	YAR070C	YAR070C	0.0	0.000	0.0	0.0	0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T

df

	0	1	2	3	4	5	6	7	8	9	...	115	116	117	118	119	120	121	122	123	124
gene_id	Gfp_transgene_gene	HRA1	snR18	tA(UGC)A	tL(CAA)A	tP(UGG)A	tS(AGA)A	YAL001C	YAL002W	YAL003W	...	YAR050W	YAR053W	YAR060C	YAR061W	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C	YAR070C
gene_name	Gfp_transgene_gene	HRA1	snR18	TGA1	SUP56	TRN1	tS(AGA)A	TFC3	VPS8	EFB1	...	FLO1	YAR053W	YAR060C	YAR061W	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C	YAR070C
RAP1_IAA_30M_REP1	0.0	0.0	3.0	0.0	0.0	0.0	1.0	55.0	36.0	632.0	...	4.357	0.0	1.0	0.0	1.0	0.0	3.0	9.0	0.0	0.0
RAP1_UNINDUCED_REP1	0.0	8.572	8.0	0.0	0.0	0.0	0.0	72.0	33.0	810.0	...	15.72	0.0	0.0	0.0	3.0	2.0	13.0	28.0	0.0	0.0
RAP1_UNINDUCED_REP2	0.0	0.0	4.0	0.0	0.0	0.0	0.0	115.0	82.0	1693.0	...	13.772	0.0	4.0	0.0	2.0	0.0	8.0	24.0	0.0	0.0
WT_REP1	0.0	0.0	8.0	0.0	0.0	1.0	0.0	60.0	63.0	1115.0	...	13.465	0.0	0.0	0.0	1.0	0.0	5.0	5.0	0.0	0.0
WT_REP2	0.0	0.0	8.0	0.0	0.0	0.0	0.0	30.0	25.0	704.0	...	6.891	0.0	1.0	0.0	0.0	0.0	11.0	7.0	1.0	0.0

7 rows × 125 columns

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])

var.head()

	gene_name
gene_id
Gfp_transgene_gene	Gfp_transgene_gene
HRA1	HRA1
snR18	snR18
tA(UGC)A	TGA1
tL(CAA)A	SUP56

Now, let’s create an AnnData:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)

adata

AnnData object with n_obs × n_vars = 5 × 125
    var: 'gene_name'

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()

gene_id	HRA1	snR18	tP(UGG)A	tS(AGA)A	YAL001C	YAL002W	YAL003W	...	YAR050W	YAR060C	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C
RAP1_IAA_30M_REP1	0.000	3.0	0.0	1.0	55.0	36.0	632.0	...	4.357	1.0	1.0	0.0	3.0	9.0	0.0
RAP1_UNINDUCED_REP1	8.572	8.0	0.0	0.0	72.0	33.0	810.0	...	15.720	0.0	3.0	2.0	13.0	28.0	0.0
RAP1_UNINDUCED_REP2	0.000	4.0	0.0	0.0	115.0	82.0	1693.0	...	13.772	4.0	2.0	0.0	8.0	24.0	0.0
WT_REP1	0.000	8.0	1.0	0.0	60.0	63.0	1115.0	...	13.465	0.0	1.0	0.0	5.0	5.0	0.0
WT_REP2	0.000	8.0	0.0	0.0	30.0	25.0	704.0	...	6.891	1.0	0.0	0.0	11.0	7.0	1.0

5 rows × 125 columns

Validate #

Let’s create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:

lb.settings.species = "saccharomyces cerevisiae"

✅ set species: Species(id='nn8c', name='saccharomyces cerevisiae', taxon_id=559292, scientific_name='saccharomyces_cerevisiae', updated_at=2023-08-28 14:19:05, bionty_source_id='3yh0', created_by_id='DzTjkKse')

Almost all gene IDs are validated:

genes = lb.Gene.from_values(adata.var.index, lb.Gene.stable_id)

💡 using global setting species = saccharomyces cerevisiae

✅ created 123 Gene records from Bionty matching stable_id: HRA1, snR18, tA(UGC)A, tL(CAA)A, tP(UGG)A, tS(AGA)A, YAL001C, YAL002W, YAL003W, YAL004W, YAL005C, YAL007C, YAL008W, YAL009W, YAL010C, YAL011W, YAL012W, YAL013W, YAL014C, YAL015C, ...

❗ did not create Gene records for 2 non-validated stable_ids: Gfp_transgene_gene, YAR062W

# also register the 2 non-validated genes
ln.save(genes)

assays = lb.ExperimentalFactor.lookup()
assays.rna_seq

ExperimentalFactor(id='J0KT2nxy', name='RNA-Seq', ontology_id='EFO:0008896', description='Rna-Seq Is A Method That Involves Purifying Rna And Making Cdna, Followed By High-Throughput Sequencing.', molecule='RNA assay', instrument='assay by high throughput sequencer', updated_at=2023-08-28 14:19:00, bionty_source_id='e9op', created_by_id='DzTjkKse')

modality = ln.Modality.filter(name="rna").one()

Register #

curated_file = ln.File.from_anndata(
    adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)

💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/ESL9RqgZckZVr66FtEy3.h5ad')

💡 parsing feature names of X stored in slot 'var'

💡    using global setting species = saccharomyces cerevisiae

✅    123 terms (98.40%) are validated for stable_id

❗    2 terms (1.60%) are not validated for stable_id: Gfp_transgene_gene, YAR062W

💡    using global setting species = saccharomyces cerevisiae

✅    linked: FeatureSet(id='tZU9vVoiDZH3Kek2jx84', n=123, type='float', registry='bionty.Gene', hash='8j8y_AHnWb5huZ2hXCDj', created_by_id='DzTjkKse')

Hence, let’s save this file:

curated_file.save()

✅ saved 1 feature set for slot: 'var'

✅ storing file 'ESL9RqgZckZVr66FtEy3' at '.lamindb/ESL9RqgZckZVr66FtEy3.h5ad'

curated_file.add_labels(assays.rna_seq, "assay")
curated_file.add_labels(lb.settings.species, "species")

✅ linked new feature 'assay' together with new feature set FeatureSet(id='ZbEXXZxaUeQlqt2Ltbzr', n=1, registry='core.Feature', hash='hM0UARNMAFiQ0sE_c5Ao', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')

💡 no file links to it anymore, deleting feature set FeatureSet(id='ZbEXXZxaUeQlqt2Ltbzr', n=1, registry='core.Feature', hash='hM0UARNMAFiQ0sE_c5Ao', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')

✅ linked new feature 'species' together with new feature set FeatureSet(id='9yiZiPqwE0hXJdOc4vp3', n=2, registry='core.Feature', hash='CddsvmYChNI_YZdvJrqm', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')

var_feature_set = curated_file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()

curated_file.describe()

💡 File(id='ESL9RqgZckZVr66FtEy3', key=None, suffix='.h5ad', accessor='AnnData', description='Curated bulk RNA counts', version=None, size=28180, hash='6bieh8XjOCCz6bJToN4u1g', hash_type='md5', created_at=2023-08-28 14:19:07, updated_at=2023-08-28 14:19:07)

Provenance:
    🗃️ storage: Storage(id='Uh08gONd', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-28 14:18:57, created_by_id='DzTjkKse')
    💫 transform: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:19:07, created_by_id='DzTjkKse')
    👣 run: Run(id='TnGfR4Zt3bidzLPLH27h', run_at=2023-08-28 14:19:05, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')
    👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:18:57)
Features:
  var (X):
    🔗 index (123, bionty.Gene.id): ['S88EOcpTXVHb', 'NWVWpw07qq1T', '9y7rNv1fwwFQ', 'v0vWU0Rtl9j4', '3cxj2SWsUJeB'...]
  external:
    🔗 assay (1, bionty.ExperimentalFactor): ['RNA-Seq']
    🔗 species (1, bionty.Species): ['saccharomyces cerevisiae']

Example queries#

We have two files in the file registry:

ln.File.filter().df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
Hx0Ge2qqFuzFi78VrgTN	Uh08gONd	output_dir/salmon.merged.gene_counts.tsv	.tsv	None	Merged Bulk RNA counts	None	None	3787	xxw0k3au3KtxFcgtbEr4eQ	md5	vFJyCpNqUbP3hW	LOSlzGEX8U0ZMjDWt5nb	2023-08-28 14:19:04	DzTjkKse
ESL9RqgZckZVr66FtEy3	Uh08gONd	None	.h5ad	AnnData	Curated bulk RNA counts	None	None	28180	6bieh8XjOCCz6bJToN4u1g	md5	s5V0dNMVwL9iz8	TnGfR4Zt3bidzLPLH27h	2023-08-28 14:19:07	DzTjkKse

curated_file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/66fbac1c01d99368db55169dc0ea7c9aafcbd023/5e531/_images/f874ed32dfce008bacb9532b6260de269027e5c4b1ad9d0f20568b713c1e6206.svg

Let’s by query by gene:

# lookup object for auto-complete
genes = lb.Gene.lookup()

genes.spo7

Gene(id='tpzwPcGZFK1y', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation [Source:SGD;Acc:S000000007]', synonyms='', updated_at=2023-08-28 14:19:07, species_id='nn8c', bionty_source_id='JO2q', created_by_id='DzTjkKse')

# a feature set containing SPO7 gene and contain RNA measurements
feature_set = ln.FeatureSet.filter(genes=genes.spo7, modality__name="rna").first()

# files that link to this feature set
ln.File.filter(feature_sets=feature_set).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
ESL9RqgZckZVr66FtEy3	Uh08gONd	None	.h5ad	AnnData	Curated bulk RNA counts	None	None	28180	6bieh8XjOCCz6bJToN4u1g	md5	s5V0dNMVwL9iz8	TnGfR4Zt3bidzLPLH27h	2023-08-28 14:19:07	DzTjkKse