Validate & register multi-modal data#
Show code cell content
!lamin init --storage ./test-multimodal --schema bionty
💡 creating schemas: core==0.46.1 bionty==0.30.0
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:20:31)
✅ saved: Storage(id='A2Of0PD5', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 14:20:31, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-multimodal
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human"
ln.settings.verbosity = 3
✅ loaded instance: testuser1/test-multimodal (lamindb 0.51.0)
✅ set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 14:20:33, bionty_source_id='eG0H', created_by_id='DzTjkKse')
ln.track()
💡 notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
✅ saved: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 14:20:33, created_by_id='DzTjkKse')
✅ saved: Run(id='O2eoUcQNFBJUGnIoHl2r', run_at=2023-08-28 14:20:33, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
MuData object#
Let’s use a MuData object:
Show code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs × n_vars = 200 × 300 var: 'name' 4 modalities rna: 200 x 173 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' adt: 200 x 4 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' hto: 200 x 12 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' gdo: 200 x 111 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name'
First we register the file:
file = ln.File(
"papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
✅ storing file '0FolQmW4RAVl0GKD7OcG' at '.lamindb/0FolQmW4RAVl0GKD7OcG.h5mu'
Register features#
Now let’s register the 3 feature sets this data contains:
rna
adt
obs (metadata)
modalities#
For the two modalities rna and adt, we use bionty tables as the reference:
mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
💡 using global setting species = human
❗ 173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
💡 using global setting species = human
✅ created 77 Gene records from Bionty matching symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, AK8, TMEM72-AS1, ARAP1-AS2, CRYAB, HOXC-AS2, LRRIQ1, TUBA3C, ...
✅ created 12 Gene records from Bionty matching synonyms: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
❗ ambiguous validation in Bionty for 6 records: HLA-DQB1-AS1, CTAGE15, CTRB2, LGALS9C, PCDHB11, TBC1D3G
❗ did not create Gene records for 84 non-validated symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, CTC-498J12.1, CTD-2562J17.2, CTD-3012A18.1, CTD-3065B20.2, ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
💡 using global setting species = human
💡 standardized 89/173 terms
validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
💡 using global setting species = human
✅ 89 terms (51.40%) are validated for symbol
❗ 84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [
lb.Gene(symbol=symbol, species=lb.settings.species)
for symbol in mdata["rna"].var_names[~validated]
]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
💡 using global setting species = human
✅ 173 terms (100.00%) are validated for symbol
feature_set_rna = ln.FeatureSet.from_values(
mdata["rna"].var_names, field=lb.Gene.symbol
)
💡 using global setting species = human
✅ 173 terms (100.00%) are validated for symbol
💡 using global setting species = human
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
💡 using global setting species = human
❗ 4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names, field=lb.CellMarker.name)
ln.save(markers)
💡 using global setting species = human
✅ created 4 CellMarker records from Bionty matching name: CD86, PDL1, PDL2, CD366
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
💡 using global setting species = human
✅ 4 terms (100.00%) are validated for name
feature_set_adt = ln.FeatureSet.from_values(
mdata["adt"].var_names, field=lb.CellMarker.name
)
💡 using global setting species = human
✅ 4 terms (100.00%) are validated for name
💡 using global setting species = human
Link them to file:
file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")
metadata#
The 3rd feature set is the obs:
obs = mdata["rna"].obs
We’re only interested in a single metadata column:
ln.Feature(name="gene_target", type="category").save()
features = ln.Feature.from_df(obs)
ln.save(features)
feature_set_obs = ln.FeatureSet.from_df(obs)
✅ 19 terms (100.00%) are validated for name
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
file.add_labels(gene_targets, feature="gene_target")
💡 using global setting species = human
✅ created 23 Gene records from Bionty matching symbol: IFNGR1, CAV1, IRF7, ATF2, NFKBIA, STAT1, SPI1, JAK2, STAT2, IFNGR2, CD86, STAT5A, SMAD4, ETV7, IRF1, UBE2L6, PDCD1LG2, BRD4, POU2F2, STAT3, ...
✅ created 1 Gene record from Bionty matching synonyms: MARCH8
❗ ambiguous validation in Bionty for 4 records: MARCHF8, IRF7, IFNGR2, TNFRSF14
❗ did not create Gene record for 1 non-validated symbol: NT
✅ linked feature 'gene_target' to registry 'bionty.Gene'
nt = ln.Label(name="NT", description="Non-targeting control of perturbations")
nt.save()
file.add_labels(nt, feature="gene_target")
✅ linked feature 'gene_target' to registry 'core.Label'
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
labels = [ln.Label(name=name) for name in obs[col].unique()]
ln.save(labels)
✅ loaded record with exact same name
Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.
file.features
'rna': FeatureSet(id='jwvaklwWosUsIWQdW9GF', n=184, type='float', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-08-28 14:20:40, created_by_id='DzTjkKse')
'adt': FeatureSet(id='51308AA42BvtqxKriJ9D', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-28 14:20:40, created_by_id='DzTjkKse')
'obs': FeatureSet(id='nOhwfDdSBAhmnFtJ2Foe', n=19, registry='core.Feature', hash='pITesqNBIdm5N31moXeX', updated_at=2023-08-28 14:20:41, created_by_id='DzTjkKse')
file.describe()
💡 File(id='0FolQmW4RAVl0GKD7OcG', key=None, suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', version=None, size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', created_at=2023-08-28 14:20:34, updated_at=2023-08-28 14:20:34)
Provenance:
🗃️ storage: Storage(id='A2Of0PD5', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 14:20:31, created_by_id='DzTjkKse')
💫 transform: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 14:20:34, created_by_id='DzTjkKse')
👣 run: Run(id='O2eoUcQNFBJUGnIoHl2r', run_at=2023-08-28 14:20:33, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:20:31)
Features:
adt:
🔗 index (4, bionty.CellMarker.id): ['82nG0xqSuEQD', 'kbrA7wdDuqDK', 'BK30rjK34sZd', 'L0m6f7FPiDeg'...]
rna:
🔗 index (184, bionty.Gene.id): ['3rBXeqBZjKfa', 'T7cV1WJ0B5w3', 'Aw7gaRitzAXN', 'Icltcl1hHDi2', 'nsBNCWQPmYZq'...]
obs (metadata):
🔗 gene_target (bionty.Gene|core.Label)
🔗 gene_target (28, bionty.Gene): ['ETV7', 'STAT3', 'CUL3', 'CAV1', 'MARCHF8']
🔗 gene_target (1, core.Label): ['NT']
file.view_lineage()
Show code cell content
!lamin delete --force test-multimodal
!rm -r test-multimodal
💡 deleting instance testuser1/test-multimodal
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal