Jupyter Notebook

Validate & register scRNA-seq datasets#

Single-cell RNA-seq (scRNA-seq) measures gene expression of individual cells and generates datasets that are often used to define cell states that associated with functional phenotypes. Data formats, such as AnnData and SingleCellExperiment objects help storing metadata and data as an entity. However, non-validated metadata are often stored which made it hard to integrate with other datasets.

In this notebook, we show how Lamin can help with manage scRNA-seq data.

!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
πŸ’‘ creating schemas: core==0.46.1 bionty==0.30.0 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:17:22)
βœ… saved: Storage(id='ljWPEsjj', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-28 14:17:23, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-scrna
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb
βœ… loaded instance: testuser1/test-scrna (lamindb 0.51.0)
ln.track()
πŸ’‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
βœ… saved: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type=notebook, updated_at=2023-08-28 14:17:25, created_by_id='DzTjkKse')
βœ… saved: Run(id='p3vdxLrlEIijgVQLdUd0', run_at=2023-08-28 14:17:25, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')

Human immune cells: Conde22#

lb.settings.species = "human"


βœ… set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 14:17:27, bionty_source_id='yKRW', created_by_id='DzTjkKse')

Transform #

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

Let’s look at a scRNA-seq count matrix in form of an AnnData object:

adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # pre-populate registries to simulate an used instance
)
Hide code cell output








adata
AnnData object with n_obs Γ— n_vars = 1648 Γ— 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

Validate #

Validate genes in .var#

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
πŸ’‘ using global setting species = human
βœ… 36355 terms (99.60%) are validated for ensembl_gene_id
❗ 148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...

We’re seeing that 148 gene identifiers can’t be validated (not currently in the Gene registry). We’d like to validate all features in this dataset, hence, let’s inspect them to see what to do:

inspect_result = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)
Hide code cell output
πŸ’‘ using global setting species = human
βœ… 36355 terms (99.60%) are validated for ensembl_gene_id
❗ 148 terms (0.40%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
πŸ’‘ using global setting species = human
πŸ’‘    detected 35 terms in Bionty for ensembl_gene_id: ENSG00000198804, ENSG00000274175, ENSG00000276760, ENSG00000198938, ENSG00000278817, ENSG00000271254, ENSG00000274847, ENSG00000277836, ENSG00000277630, ENSG00000275869, ENSG00000278704, ENSG00000276256, ENSG00000198786, ENSG00000228253, ENSG00000278633, ENSG00000278384, ENSG00000198899, ENSG00000277856, ENSG00000198886, ENSG00000268674, ...
πŸ’‘ β†’  add records from Bionty to your registry via .from_values()
πŸ’‘    couldn't validate 113 terms: ENSG00000278782, ENSG00000226380, ENSG00000256892, ENSG00000273576, ENSG00000237838, ENSG00000271409, ENSG00000273888, ENSG00000273496, ENSG00000272267, ENSG00000273301, ENSG00000261438, ENSG00000280095, ENSG00000233776, ENSG00000273370, ENSG00000276814, ENSG00000272880, ENSG00000236996, ENSG00000259834, ENSG00000249860, ENSG00000215271, ...
πŸ’‘ β†’  if you are sure, add records to your registry via .from_values()

Inspect logging says 35 of the non-validated ensembl_gene_ids can be found in Bionty reference. Let’s register them:

records_bionty = lb.Gene.from_values(
    inspect_result.non_validated, lb.Gene.ensembl_gene_id
)
ln.save(records_bionty)
Hide code cell output
πŸ’‘ using global setting species = human
βœ… created 35 Gene records from Bionty matching ensembl_gene_id: ENSG00000198804, ENSG00000198712, ENSG00000228253, ENSG00000198899, ENSG00000198938, ENSG00000198840, ENSG00000212907, ENSG00000198886, ENSG00000198786, ENSG00000198695, ENSG00000198727, ENSG00000278704, ENSG00000277400, ENSG00000274847, ENSG00000276256, ENSG00000277630, ENSG00000278384, ENSG00000273748, ENSG00000271254, ENSG00000277475, ...
❗ did not create Gene records for 113 non-validated ensembl_gene_ids: ENSG00000112096, ENSG00000182230, ENSG00000203812, ENSG00000204092, ENSG00000215271, ENSG00000221995, ENSG00000224739, ENSG00000224745, ENSG00000225932, ENSG00000226377, ENSG00000226380, ENSG00000226403, ENSG00000227021, ENSG00000227220, ENSG00000227902, ENSG00000228139, ENSG00000228906, ENSG00000229352, ENSG00000231575, ENSG00000232196, ...

The rest 113 aren’t present in the current Ensembl assembly (e.g. ENSG00000112096).

We’d still like to register them, so let’s create Gene records with those ensembl_gene_ids:

validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id, mute=True)
nonval_ensembl_ids = adata.var.index[~validated]
new_records = [
    lb.Gene(ensembl_gene_id=ens_id, species=lb.settings.species)
    for ens_id in nonval_ensembl_ids
]
ln.save(new_records)
Hide code cell output
πŸ’‘ using global setting species = human

Now all genes pass validation:

lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);
πŸ’‘ using global setting species = human
βœ… 36503 terms (100.00%) are validated for ensembl_gene_id

Validate metadata in .obs#

adata.obs.columns
Index(['donor', 'tissue', 'cell_type', 'assay'], dtype='object')

1 feature is not validated: donor

validated = ln.Feature.validate(adata.obs.columns)
βœ… 3 terms (75.00%) are validated for name
❗ 1 term (25.00%) is not validated for name: donor

Let’s register it:

features = ln.Feature.from_df(adata.obs)
ln.save(features)

All metadata columns are now validated as feature:

ln.Feature.validate(adata.obs.columns);
βœ… 4 terms (100.00%) are validated for name

Next, let’s validate the corresponding labels of each feature:

Some of the metadata labels can be typed using dedicated registries: (e.g. bionty offers ontology-based registries for biological entities)

validated = lb.CellType.validate(adata.obs.cell_type)
❗ received 32 unique terms, 1616 empty/duplicated terms are ignored
βœ… 30 terms (93.80%) are validated for name
❗ 2 terms (6.20%) are not validated for name: germinal center B cell, megakaryocyte

Register non-validated cell types from Bionty:

nonval_cell_type_records = lb.CellType.from_values(
    adata.obs.cell_type[~validated], "name"
)
ln.save(nonval_cell_type_records)
Hide code cell output
βœ… created 2 CellType records from Bionty matching name: germinal center B cell, megakaryocyte
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='uMLhrmbZ', name='germinal center B cell', ontology_id='CL:0000844', synonyms='GC B-cell|GC B cell|GC B lymphocyte|germinal center B lymphocyte|GC B-lymphocyte|germinal center B-cell|germinal center B-lymphocyte', description='A Rapidly Cycling Mature B Cell That Has Distinct Phenotypic Characteristics And Is Involved In T-Dependent Immune Responses And Located Typically In The Germinal Centers Of Lymph Nodes. This Cell Type Expresses Ly77 After Activation.', updated_at=2023-08-28 14:17:55, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000785
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='0I51jgPp', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B lymphocyte|mature B-cell|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', updated_at=2023-08-28 14:17:56, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0001201
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='CIS4VJI0', name='B cell, CD19-positive', ontology_id='CL:0001201', synonyms='CD19+ B cell|B lymphocyte, CD19-positive|B-lymphocyte, CD19-positive|CD19-positive B cell|B-cell, CD19-positive', description='A B Cell That Is Cd19-Positive.', updated_at=2023-08-28 14:17:57, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000236
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='cx8VcggA', name='B cell', ontology_id='CL:0000236', synonyms='B-cell|B lymphocyte|B-lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2023-08-28 14:17:58, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000945
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='Z0yFV7vU', name='lymphocyte of B lineage', ontology_id='CL:0000945', description='A Lymphocyte Of B Lineage With The Commitment To Express An Immunoglobulin Complex.', updated_at=2023-08-28 14:17:59, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='UrtDirMx', name='megakaryocyte', ontology_id='CL:0000556', synonyms='megalocaryocyte|megalokaryocyte|megacaryocyte', description='A Large Hematopoietic Cell (50 To 100 Micron) With A Lobated Nucleus. Once Mature, This Cell Undergoes Multiple Rounds Of Endomitosis And Cytoplasmic Restructuring To Allow Platelet Formation And Release.', updated_at=2023-08-28 14:17:55, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000763
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='g1zY6vUW', name='myeloid cell', ontology_id='CL:0000763', description='A Cell Of The Monocyte, Granulocyte, Mast Cell, Megakaryocyte, Or Erythroid Lineage.', updated_at=2023-08-28 14:18:00, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000988
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='Q0aQr5JB', name='hematopoietic cell', ontology_id='CL:0000988', synonyms='haematopoietic cell|hemopoietic cell|haemopoietic cell', description='A Cell Of A Hematopoietic Lineage.', updated_at=2023-08-28 14:18:01, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… loaded 1 CellType record matching ontology_id: CL:0000548
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0002371
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='QMAH6IlS', name='somatic cell', ontology_id='CL:0002371', description='A Cell Of An Organism That Does Not Pass On Its Genetic Material To The Organism'S Offspring (I.E. A Non-Germ Line Cell).', updated_at=2023-08-28 14:18:02, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… loaded 1 CellType record matching ontology_id: CL:0000548
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000003
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='VT73gpK2', name='native cell', ontology_id='CL:0000003', description='A Cell That Is Found In A Natural Setting, Which Includes Multicellular Organism Cells 'In Vivo' (I.E. Part Of An Organism), And Unicellular Organisms 'In Environment' (I.E. Part Of A Natural Environment).', updated_at=2023-08-28 14:18:03, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000000
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);
βœ… 3 terms (100.00%) are validated for name
βœ… 17 terms (100.00%) are validated for name

Metadata that can’t be typed with dedicated registries (in this example, we didn’t mount a custom schema that contains a Donor registry), we can use the Label registry to track donor ids.

ln.Label.validate(adata.obs["donor"]);
❗ received 12 unique terms, 1636 empty/duplicated terms are ignored
❗ 12 terms (100.00%) are not validated for name: D496, 621B, A29, A36, A35, 637C, A52, A37, D503, 640C, A31, 582C

Donor labels are not validated, so let’s register them:

donors = [ln.Label(name=name) for name in adata.obs["donor"].unique()]
ln.save(donors)
ln.Label.validate(adata.obs["donor"]);
βœ… 12 terms (100.00%) are validated for name

Validate external metadata#

In addition to what’s already in the file, we’d like to link this file with external features including β€œspecies” and β€œassay”:

ln.Feature.validate("species")
ln.Feature.validate("assay");
βœ… 1 term (100.00%) is validated for name
βœ… 1 term (100.00%) is validated for name

Validate corresponding labels of these features:

Sometimes we don’t remember what the term is called exactly, search can help:

lb.ExperimentalFactor.search("scRNA-seq").head(2)
id synonyms __ratio__
name
single-cell RNA sequencing 068T1Df6 single-cell RNA-seq|scRNA-seq|single cell RNA ... 100.000000
10x 3' v3 Vep0itYq 10X 3' v3 11.111111
scrna = lb.ExperimentalFactor.filter(id="068T1Df6").one()

Register #

Register data#

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

file = ln.File.from_anndata(
    adata, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
Hide code cell output
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/WtEvWQ5KVML36kWeCyJt.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    36503 terms (100.00%) are validated for ensembl_gene_id
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='9aQEHOOwygiZbsB8YESw', n=36503, type='float', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    4 terms (100.00%) are validated for name
βœ…    linked: FeatureSet(id='uMBfdWzGMsFzFg0c1kH5', n=4, registry='core.Feature', hash='_1tVb4jVGQMeUa0HqRVx', modality_id='JZIEMJb5', created_by_id='DzTjkKse')
file.save()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'WtEvWQ5KVML36kWeCyJt' at '.lamindb/WtEvWQ5KVML36kWeCyJt.h5ad'

The file has the following 2 linked feature sets:

file.features
'var': FeatureSet(id='9aQEHOOwygiZbsB8YESw', n=36503, type='float', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-08-28 14:18:07, created_by_id='DzTjkKse')
'obs': FeatureSet(id='uMBfdWzGMsFzFg0c1kH5', n=4, registry='core.Feature', hash='_1tVb4jVGQMeUa0HqRVx', updated_at=2023-08-28 14:18:13, modality_id='JZIEMJb5', created_by_id='DzTjkKse')

You can further annotate your feature set with modality:

var_feature_set = file.features.get_feature_set("var")
modalities = ln.Modality.lookup()
var_feature_set.modality = modalities.rna
var_feature_set.save()

A less well curated dataset#

Transform #

Let’s now consider a dataset with less-well curated features:

pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols:

pbcm68k.var.index
Index(['HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP',
       'TNFRSF1B', 'EFHD2',
       ...
       'ATP5O', 'MRPS6', 'TTC3', 'U2AF1', 'CSTB', 'SUMO3', 'ITGB2', 'S100B',
       'PRMT2', 'MT-ND3'],
      dtype='object', name='index', length=765)

Validate #

validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)
πŸ’‘ using global setting species = human
βœ… 695 terms (90.80%) are validated for symbol
❗ 70 terms (9.20%) are not validated for symbol: ATPIF1, C1orf228, CCBL2, RP11-782C8.1, RP11-277L2.3, RP11-156E8.1, AC079767.4, GPX1, H1FX, SELT, ATP5I, IGJ, CCDC109B, FYB, H2AFY, FAM65B, HIST1H4C, HIST1H1E, ZNRD1, C6orf48, ...

In this case, we only want to register data with validated genes:

pbcm68k_validated = pbcm68k[:, validated].copy()

Validate cell types:

# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

# standardize cell type names in the dataset
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)
Hide code cell output
❗ received 9 unique terms, 61 empty/duplicated terms are ignored
❗ 9 terms (100.00%) are not validated for name: Dendritic cells, CD19+ B, CD4+/CD45RO+ Memory, CD8+ Cytotoxic T, CD4+/CD25 T Reg, CD14+ Monocytes, CD56+ NK, CD8+/CD45RA+ Naive Cytotoxic, CD34+
πŸ’‘    couldn't validate 9 terms: CD8+/CD45RA+ Naive Cytotoxic, CD8+ Cytotoxic T, CD56+ NK, CD4+/CD45RO+ Memory, CD19+ B, Dendritic cells, CD34+, CD14+ Monocytes, CD4+/CD25 T Reg
πŸ’‘ β†’  if you are sure, add records to your registry via .from_values()
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000451
πŸ’‘ also saving parents of CellType(id='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-08-28 14:18:16, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000738
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='MkrH0gsX', name='leukocyte', ontology_id='CL:0000738', synonyms='white blood cell|leucocyte', description='An Achromatic Cell Of The Myeloid Or Lymphoid Lineages Capable Of Ameboid Movement, Found In Blood Or Other Tissue.', updated_at=2023-08-28 14:18:17, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='9JGbXeUA', name='dendritic cell', ontology_id='CL:0000451', synonyms='Dendritic cells', description='A Cell Of Hematopoietic Origin, Typically Resident In Particular Tissues, Specialized In The Uptake, Processing, And Transport Of Antigens To Lymph Nodes For The Purpose Of Stimulating An Immune Response Via T Cell Activation. These Cells Are Lineage Negative (Cd3-Negative, Cd19-Negative, Cd34-Negative, And Cd56-Negative).', updated_at=2023-08-28 14:18:17, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0001087
πŸ’‘ also saving parents of CellType(id='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4-positive TEMRA|CD4+ TEMRA', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-08-28 14:18:18, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 2 CellType records from Bionty matching ontology_id: CL:4030002, CL:0000897
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='ylUbqlrS', name='effector memory CD45RA-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:4030002', synonyms='terminally differentiated effector memory cells re-expressing CD45RA|terminally differentiated effector memory CD45RA+ T cells|TEMRA cell', description='An Alpha-Beta Memory T Cell With The Phenotype Cd45Ra-Positive.', updated_at=2023-08-28 14:18:19, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000791
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='WKpZjuYS', name='mature alpha-beta T cell', ontology_id='CL:0000791', synonyms='mature alpha-beta T-lymphocyte|mature alpha-beta T lymphocyte|mature alpha-beta T-cell', description='A Alpha-Beta T Cell That Has A Mature Phenotype.', updated_at=2023-08-28 14:18:19, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='s6Ag7R5U', name='CD4-positive, alpha-beta memory T cell', ontology_id='CL:0000897', synonyms='CD4-positive, alpha-beta memory T-cell|CD4-positive, alpha-beta memory T-lymphocyte|CD4-positive, alpha-beta memory T lymphocyte', description='A Cd4-Positive, Alpha-Beta T Cell That Has Differentiated Into A Memory T Cell.', updated_at=2023-08-28 14:18:19, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000624
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='05vQoepH', name='CD4-positive, alpha-beta T cell', ontology_id='CL:0000624', synonyms='CD4-positive, alpha-beta T lymphocyte|CD4-positive, alpha-beta T-cell|CD4-positive, alpha-beta T-lymphocyte', description='A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor.', updated_at=2023-08-28 14:18:21, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='6VQXlWS7', name='effector memory CD4-positive, alpha-beta T cell, terminally differentiated', ontology_id='CL:0001087', synonyms='CD4+ TEMRA|CD4-positive TEMRA|CD4+/CD45RO+ Memory', description='A Cd4-Positive, Alpha Beta Memory T Cell With The Phenotype Cd45Ra-Positive, Cd45Ro-Negative, And Ccr7-Negative.', updated_at=2023-08-28 14:18:21, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000910
πŸ’‘ also saving parents of CellType(id='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='cytotoxic T lymphocyte|cytotoxic T-lymphocyte|cytotoxic T-cell', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-08-28 14:18:22, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000911
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='yvHkIrVI', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-lymphocyte|effector T-cell|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', updated_at=2023-08-28 14:18:23, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0002419
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='2C5PhwrW', name='mature T cell', ontology_id='CL:0002419', synonyms='mature T-cell|CD3e-positive T cell', description='A T Cell That Expresses A T Cell Receptor Complex And Has Completed T Cell Selection.', updated_at=2023-08-28 14:18:23, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000084
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='BxNjby0x', name='T cell', ontology_id='CL:0000084', synonyms='T-lymphocyte|T-cell|T lymphocyte', description='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', updated_at=2023-08-28 14:18:25, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='OxsmyL44', name='cytotoxic T cell', ontology_id='CL:0000910', synonyms='cytotoxic T-cell|CD8+ Cytotoxic T|cytotoxic T lymphocyte|cytotoxic T-lymphocyte', description='A Mature T Cell That Differentiated And Acquired Cytotoxic Function With The Phenotype Perforin-Positive And Granzyme-B Positive.', updated_at=2023-08-28 14:18:25, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000919
πŸ’‘ also saving parents of CellType(id='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8+CD25+ Treg|CD8+CD25+ T-lymphocyte|CD8+CD25+ T(reg)|CD8+CD25+ T lymphocyte|CD8+CD25+ T cell|CD8-positive, CD25-positive Treg|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8-positive, CD25-positive, alpha-beta regulatory T-cell|CD8+CD25+ T-cell|CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-08-28 14:18:25, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000795
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='oTsFrhYW', name='CD8-positive, alpha-beta regulatory T cell', ontology_id='CL:0000795', synonyms='CD8-positive, alpha-beta regulatory T-cell|CD8-positive, alpha-beta Treg|CD8-positive T(reg)|CD8-positive, alpha-beta regulatory T lymphocyte|CD8+ Treg|CD8+ T(reg)|CD8+ regulatory T cell|CD8-positive, alpha-beta regulatory T-lymphocyte|CD8-positive Treg', description='A Cd8-Positive, Alpha-Beta T Cell That Regulates Overall Immune Responses As Well As The Responses Of Other T Cell Subsets Through Direct Cell-Cell Contact And Cytokine Release.', updated_at=2023-08-28 14:18:26, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0000625
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ you can switch this off via: lb.settings.auto_save_parents = False
πŸ’‘ also saving parents of CellType(id='VnKkQsME', name='CD8-positive, alpha-beta T cell', ontology_id='CL:0000625', synonyms='CD8-positive, alpha-beta T lymphocyte|CD8-positive, alpha-beta T-lymphocyte|CD8-positive, alpha-beta T-cell', description='A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor.', updated_at=2023-08-28 14:18:27, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='ORD0dMdt', name='CD8-positive, CD25-positive, alpha-beta regulatory T cell', ontology_id='CL:0000919', synonyms='CD8+CD25+ T(reg)|CD8+CD25+ T-cell|CD8+CD25+ T lymphocyte|CD8-positive, CD25-positive Treg|CD8-positive, CD25-positive, alpha-beta regulatory T-cell|CD8-positive, CD25-positive, alpha-beta regulatory T lymphocyte|CD8+CD25+ T cell|CD8+CD25+ Treg|CD8-positive, CD25-positive, alpha-beta regulatory T-lymphocyte|CD8+CD25+ T-lymphocyte|CD4+/CD25 T Reg', description='A Cd8-Positive Alpha Beta-Positive T Cell With The Phenotype Foxp3-Positive And Having Suppressor Function.', updated_at=2023-08-28 14:18:27, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0002057
πŸ’‘ also saving parents of CellType(id='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16-negative monocyte|CD16- monocyte', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-08-28 14:18:28, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='O0AQiAuv', name='CD14-positive, CD16-negative classical monocyte', ontology_id='CL:0002057', synonyms='CD16-negative monocyte|CD14+ Monocytes|CD16- monocyte', description='A Classical Monocyte That Is Cd14-Positive, Cd16-Negative, Cd64-Positive, Cd163-Positive.', updated_at=2023-08-28 14:18:28, bionty_source_id='YNLz', created_by_id='DzTjkKse')
βœ… created 1 CellType record from Bionty matching ontology_id: CL:0002102
πŸ’‘ also saving parents of CellType(id='Xkw89opD', name='CD38-negative naive B cell', ontology_id='CL:0002102', synonyms='CD38-negative naive B lymphocyte|CD38-negative naive B-cell|CD38- naive B-cell|CD38-negative naive B-lymphocyte|CD38- naive B lymphocyte|CD38- naive B-lymphocyte|CD38- naive B cell', description='A Cd38-Negative Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Negative, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-08-28 14:18:30, bionty_source_id='YNLz', created_by_id='DzTjkKse')
πŸ’‘ also saving parents of CellType(id='Xkw89opD', name='CD38-negative naive B cell', ontology_id='CL:0002102', synonyms='CD8+/CD45RA+ Naive Cytotoxic|CD38-negative naive B lymphocyte|CD38-negative naive B-cell|CD38- naive B cell|CD38-negative naive B-lymphocyte|CD38- naive B-lymphocyte|CD38- naive B lymphocyte|CD38- naive B-cell', description='A Cd38-Negative Naive B Cell Is A Mature B Cell That Has The Phenotype Cd38-Negative, Surface Igd-Positive, Surface Igm-Positive, And Cd27-Negative, That Has Not Yet Been Activated By Antigen In The Periphery.', updated_at=2023-08-28 14:18:30, bionty_source_id='YNLz', created_by_id='DzTjkKse')

Now, all cell types are validated:

lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);
βœ… 9 terms (100.00%) are validated for name

Register #

file = ln.File.from_anndata(
    pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
πŸ’‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/RAdyFo8MWTzVkqkFCK8T.h5ad')
πŸ’‘ parsing feature names of X stored in slot 'var'
πŸ’‘    using global setting species = human
βœ…    695 terms (100.00%) are validated for symbol
πŸ’‘    using global setting species = human
βœ…    linked: FeatureSet(id='XMlfEarupsd9OFd0NGV1', n=695, type='float', registry='bionty.Gene', hash='W4ps_86b5dxk2Wd1gWTo', created_by_id='DzTjkKse')
πŸ’‘ parsing feature names of slot 'obs'
βœ…    1 term (25.00%) is validated for name
❗    3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
βœ…    linked: FeatureSet(id='y7r7SCPwKPLeDhv49YZg', n=1, registry='core.Feature', hash='FL9XYnterVB4xhb6qqdq', modality_id='JZIEMJb5', created_by_id='DzTjkKse')
file.save()
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… storing file 'RAdyFo8MWTzVkqkFCK8T' at '.lamindb/RAdyFo8MWTzVkqkFCK8T.h5ad'
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modalities.rna
var_feature_set.save()
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types, "cell_type")
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")
βœ… loaded: FeatureSet(id='Bhy6FLtgn8b0lidPS6jq', n=1, registry='core.Feature', hash='-LCX9BhJFpMxaaKt2TFF', updated_at=2023-08-28 14:18:13, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
βœ… linked new feature 'species' together with new feature set FeatureSet(id='Bhy6FLtgn8b0lidPS6jq', n=1, registry='core.Feature', hash='-LCX9BhJFpMxaaKt2TFF', updated_at=2023-08-28 14:18:31, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
πŸ’‘ no file links to it anymore, deleting feature set FeatureSet(id='Bhy6FLtgn8b0lidPS6jq', n=1, registry='core.Feature', hash='-LCX9BhJFpMxaaKt2TFF', updated_at=2023-08-28 14:18:31, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
βœ… linked new feature 'assay' together with new feature set FeatureSet(id='Sj4jZVKWvHH2gFOY2J34', n=2, registry='core.Feature', hash='7dJ8cWRsVYr3yWCJuKhJ', updated_at=2023-08-28 14:18:31, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
file.features
'var': FeatureSet(id='XMlfEarupsd9OFd0NGV1', n=695, type='float', registry='bionty.Gene', hash='W4ps_86b5dxk2Wd1gWTo', updated_at=2023-08-28 14:18:31, modality_id='bYfmTzpe', created_by_id='DzTjkKse')
'obs': FeatureSet(id='y7r7SCPwKPLeDhv49YZg', n=1, registry='core.Feature', hash='FL9XYnterVB4xhb6qqdq', updated_at=2023-08-28 14:18:31, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
'external': FeatureSet(id='Sj4jZVKWvHH2gFOY2J34', n=2, registry='core.Feature', hash='7dJ8cWRsVYr3yWCJuKhJ', updated_at=2023-08-28 14:18:31, modality_id='JZIEMJb5', created_by_id='DzTjkKse')
file.describe()
πŸ’‘ File(id='RAdyFo8MWTzVkqkFCK8T', key=None, suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', version=None, size=589484, hash='eKVXV5okt5YRYjySMTKGEw', hash_type='md5', created_at=2023-08-28 14:18:31, updated_at=2023-08-28 14:18:31)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='ljWPEsjj', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-08-28 14:17:23, created_by_id='DzTjkKse')
    πŸ’« transform: Transform(id='Nv48yAceNSh8z8', name='Validate & register scRNA-seq datasets', short_name='scrna', version='0', type=notebook, updated_at=2023-08-28 14:18:31, created_by_id='DzTjkKse')
    πŸ‘£ run: Run(id='p3vdxLrlEIijgVQLdUd0', run_at=2023-08-28 14:17:25, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:17:22)
Features:
  var (X):
    πŸ”— index (695, bionty.Gene.id): ['asa6P3SWGqBF', 'sOu1hW4id709', 'mLZxpATriwGh', 'yo4j3UPxzM21', 'z4HRihQZPQ11'...]
  external:
    πŸ”— assay (1, bionty.ExperimentalFactor): ['single-cell RNA sequencing']
    πŸ”— species (1, bionty.Species): ['human']
  obs (metadata):
    πŸ”— cell_type (9, bionty.CellType): ['cytotoxic T cell', 'CD38-negative naive B cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD16-positive, CD56-dim natural killer cell, human', 'B cell, CD19-positive']
file.view_lineage()
https://d33wubrfki0l68.cloudfront.net/1d11a6b62481e4ee24a8869191134c50975838e4/309c6/_images/16dd9e25d96d0da4d6d5bd10f60e0ecd757287247e375ea1820d09b86fa4a003.svg

πŸŽ‰ Now let’s continue with data integration: Integrate scRNA-seq datasets