Jupyter Notebook

Validate & register bulk RNA-seq datasets#

!lamin init --storage test-bulkrna --schema bionty
Hide code cell output
💡 creating schemas: core==0.46.1 bionty==0.30.0 
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:18:57)
✅ saved: Storage(id='Uh08gONd', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-28 14:18:57, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-bulkrna
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
✅ loaded instance: testuser1/test-bulkrna (lamindb 0.51.0)

Let’s start simulating the output of a bulk RNA-seq pipeline:

Hide code cell content
# pretned we're running a bulk RNA-seq pipeline
ln.track(ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.dev.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.File(path, description="Merged Bulk RNA counts").save()
✅ saved: Transform(id='vFJyCpNqUbP3hW', name='nf-core RNA-seq', type=notebook, reference='https://nf-co.re/rnaseq', updated_at=2023-08-28 14:18:59, created_by_id='DzTjkKse')
✅ saved: Run(id='LOSlzGEX8U0ZMjDWt5nb', run_at=2023-08-28 14:18:59, transform_id='vFJyCpNqUbP3hW', created_by_id='DzTjkKse')
❗ file has more than two suffixes (path.suffixes), using only last suffix: '.tsv'
💡 file in storage 'test-bulkrna' with key 'output_dir/salmon.merged.gene_counts.tsv'

Curate the count matrix #

ln.track()
💡 notebook imports: anndata==0.9.2 lamindb==0.51.0 lnschema_bionty==0.30.0 pandas==1.5.3
✅ saved: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:19:05, created_by_id='DzTjkKse')
✅ saved: Run(id='TnGfR4Zt3bidzLPLH27h', run_at=2023-08-28 14:19:05, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')

Let’s query the file:

file = ln.File.filter(description="Merged Bulk RNA counts").one()
df = file.load()
💡 adding file Hx0Ge2qqFuzFi78VrgTN as input for run TnGfR4Zt3bidzLPLH27h, adding parent transform vFJyCpNqUbP3hW

If we look at it, we realize it deviates far from the tidy data standard [Wickham14], conventions of statistics & machine learning [Hastie09, Murphy12] and the major Python & R data packages.

Variables aren’t in columns and observations aren’t in rows:

df
gene_id gene_name RAP1_IAA_30M_REP1 RAP1_UNINDUCED_REP1 RAP1_UNINDUCED_REP2 WT_REP1 WT_REP2
0 Gfp_transgene_gene Gfp_transgene_gene 0.0 0.000 0.0 0.0 0.0
1 HRA1 HRA1 0.0 8.572 0.0 0.0 0.0
2 snR18 snR18 3.0 8.000 4.0 8.0 8.0
3 tA(UGC)A TGA1 0.0 0.000 0.0 0.0 0.0
4 tL(CAA)A SUP56 0.0 0.000 0.0 0.0 0.0
... ... ... ... ... ... ... ...
120 YAR064W YAR064W 0.0 2.000 0.0 0.0 0.0
121 YAR066W YAR066W 3.0 13.000 8.0 5.0 11.0
122 YAR068W YAR068W 9.0 28.000 24.0 5.0 7.0
123 YAR069C YAR069C 0.0 0.000 0.0 0.0 1.0
124 YAR070C YAR070C 0.0 0.000 0.0 0.0 0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T

df
0 1 2 3 4 5 6 7 8 9 ... 115 116 117 118 119 120 121 122 123 124
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
gene_name Gfp_transgene_gene HRA1 snR18 TGA1 SUP56 TRN1 tS(AGA)A TFC3 VPS8 EFB1 ... FLO1 YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.0 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.72 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.0 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.0 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.0 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

7 rows × 125 columns

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
gene_name
gene_id
Gfp_transgene_gene Gfp_transgene_gene
HRA1 HRA1
snR18 snR18
tA(UGC)A TGA1
tL(CAA)A SUP56

Now, let’s create an AnnData:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)

adata
AnnData object with n_obs × n_vars = 5 × 125
    var: 'gene_name'

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.000 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.720 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.000 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.000 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.000 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

5 rows × 125 columns

Validate #

Let’s create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:

lb.settings.species = "saccharomyces cerevisiae"
✅ set species: Species(id='nn8c', name='saccharomyces cerevisiae', taxon_id=559292, scientific_name='saccharomyces_cerevisiae', updated_at=2023-08-28 14:19:05, bionty_source_id='3yh0', created_by_id='DzTjkKse')

Almost all gene IDs are validated:

genes = lb.Gene.from_values(adata.var.index, lb.Gene.stable_id)
💡 using global setting species = saccharomyces cerevisiae


✅ created 123 Gene records from Bionty matching stable_id: HRA1, snR18, tA(UGC)A, tL(CAA)A, tP(UGG)A, tS(AGA)A, YAL001C, YAL002W, YAL003W, YAL004W, YAL005C, YAL007C, YAL008W, YAL009W, YAL010C, YAL011W, YAL012W, YAL013W, YAL014C, YAL015C, ...
did not create Gene records for 2 non-validated stable_ids: Gfp_transgene_gene, YAR062W
# also register the 2 non-validated genes
ln.save(genes)
assays = lb.ExperimentalFactor.lookup()
assays.rna_seq
ExperimentalFactor(id='J0KT2nxy', name='RNA-Seq', ontology_id='EFO:0008896', description='Rna-Seq Is A Method That Involves Purifying Rna And Making Cdna, Followed By High-Throughput Sequencing.', molecule='RNA assay', instrument='assay by high throughput sequencer', updated_at=2023-08-28 14:19:00, bionty_source_id='e9op', created_by_id='DzTjkKse')
modality = ln.Modality.filter(name="rna").one()

Register #

curated_file = ln.File.from_anndata(
    adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/ESL9RqgZckZVr66FtEy3.h5ad')
💡 parsing feature names of X stored in slot 'var'
💡    using global setting species = saccharomyces cerevisiae
123 terms (98.40%) are validated for stable_id
2 terms (1.60%) are not validated for stable_id: Gfp_transgene_gene, YAR062W
💡    using global setting species = saccharomyces cerevisiae
✅    linked: FeatureSet(id='tZU9vVoiDZH3Kek2jx84', n=123, type='float', registry='bionty.Gene', hash='8j8y_AHnWb5huZ2hXCDj', created_by_id='DzTjkKse')

Hence, let’s save this file:

curated_file.save()
✅ saved 1 feature set for slot: 'var'
✅ storing file 'ESL9RqgZckZVr66FtEy3' at '.lamindb/ESL9RqgZckZVr66FtEy3.h5ad'
curated_file.add_labels(assays.rna_seq, "assay")
curated_file.add_labels(lb.settings.species, "species")
✅ linked new feature 'assay' together with new feature set FeatureSet(id='ZbEXXZxaUeQlqt2Ltbzr', n=1, registry='core.Feature', hash='hM0UARNMAFiQ0sE_c5Ao', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')
💡 no file links to it anymore, deleting feature set FeatureSet(id='ZbEXXZxaUeQlqt2Ltbzr', n=1, registry='core.Feature', hash='hM0UARNMAFiQ0sE_c5Ao', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')
✅ linked new feature 'species' together with new feature set FeatureSet(id='9yiZiPqwE0hXJdOc4vp3', n=2, registry='core.Feature', hash='CddsvmYChNI_YZdvJrqm', updated_at=2023-08-28 14:19:07, modality_id='sFgWiclW', created_by_id='DzTjkKse')
var_feature_set = curated_file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
curated_file.describe()
💡 File(id='ESL9RqgZckZVr66FtEy3', key=None, suffix='.h5ad', accessor='AnnData', description='Curated bulk RNA counts', version=None, size=28180, hash='6bieh8XjOCCz6bJToN4u1g', hash_type='md5', created_at=2023-08-28 14:19:07, updated_at=2023-08-28 14:19:07)

Provenance:
    🗃️ storage: Storage(id='Uh08gONd', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-28 14:18:57, created_by_id='DzTjkKse')
    💫 transform: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:19:07, created_by_id='DzTjkKse')
    👣 run: Run(id='TnGfR4Zt3bidzLPLH27h', run_at=2023-08-28 14:19:05, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')
    👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:18:57)
Features:
  var (X):
    🔗 index (123, bionty.Gene.id): ['S88EOcpTXVHb', 'NWVWpw07qq1T', '9y7rNv1fwwFQ', 'v0vWU0Rtl9j4', '3cxj2SWsUJeB'...]
  external:
    🔗 assay (1, bionty.ExperimentalFactor): ['RNA-Seq']
    🔗 species (1, bionty.Species): ['saccharomyces cerevisiae']

Example queries#

We have two files in the file registry:

ln.File.filter().df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
Hx0Ge2qqFuzFi78VrgTN Uh08gONd output_dir/salmon.merged.gene_counts.tsv .tsv None Merged Bulk RNA counts None None 3787 xxw0k3au3KtxFcgtbEr4eQ md5 vFJyCpNqUbP3hW LOSlzGEX8U0ZMjDWt5nb 2023-08-28 14:19:04 DzTjkKse
ESL9RqgZckZVr66FtEy3 Uh08gONd None .h5ad AnnData Curated bulk RNA counts None None 28180 6bieh8XjOCCz6bJToN4u1g md5 s5V0dNMVwL9iz8 TnGfR4Zt3bidzLPLH27h 2023-08-28 14:19:07 DzTjkKse
curated_file.view_lineage()
https://d33wubrfki0l68.cloudfront.net/66fbac1c01d99368db55169dc0ea7c9aafcbd023/5e531/_images/f874ed32dfce008bacb9532b6260de269027e5c4b1ad9d0f20568b713c1e6206.svg

Let’s by query by gene:

# lookup object for auto-complete
genes = lb.Gene.lookup()
genes.spo7
Gene(id='tpzwPcGZFK1y', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation [Source:SGD;Acc:S000000007]', synonyms='', updated_at=2023-08-28 14:19:07, species_id='nn8c', bionty_source_id='JO2q', created_by_id='DzTjkKse')
# a feature set containing SPO7 gene and contain RNA measurements
feature_set = ln.FeatureSet.filter(genes=genes.spo7, modality__name="rna").first()
# files that link to this feature set
ln.File.filter(feature_sets=feature_set).df()
storage_id key suffix accessor description version initial_version_id size hash hash_type transform_id run_id updated_at created_by_id
id
ESL9RqgZckZVr66FtEy3 Uh08gONd None .h5ad AnnData Curated bulk RNA counts None None 28180 6bieh8XjOCCz6bJToN4u1g md5 s5V0dNMVwL9iz8 TnGfR4Zt3bidzLPLH27h 2023-08-28 14:19:07 DzTjkKse
Hide code cell content
!lamin delete --force test-bulkrna
!rm -r test-bulkrna
💡 deleting instance testuser1/test-bulkrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-bulkrna.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna