Chaining Search Terms¶

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:

            
                Copied!
                
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.22

The CDA provides a custom python tool for searching CDA data. Q (short for Query) offers several ways to search and filter data, and several input modes:

Q.run() returns data for the specified search
Q.count() returns summary information (counts) data that fit the specified search

Endpoint Chaining¶

We're going to build on our previous search to see what information exists about cancers that were first diagnosed in the brain.

In [2]:

            
                Copied!
                
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')

Previously we looked at subject, research_subject, specimen and file results seperately, but we can also combine these.

Let's say what we're really interested in is finding analysis done on specimens, so we're looking for files that belong to specimens that match our search. If we search the files endpoint directly, we'll get back all files that meet our search criteria, regardless of whether they are specimens specficially. Instead, we can chain our query to the specimen endpoint and then to the files endpoint and get the combined result:

In [3]:

            
                Copied!
                
myqueryspecimenfiles =  myquery.specimen.file.run()
myqueryspecimenfiles
myqueryspecimenfiles =  myquery.specimen.file.run()
myqueryspecimenfiles

Getting results from database

Total execution time: 4028 ms

Out[3]:

            QueryID: bbbaa1a9-ead3-41b2-83c9-2538464ced59
            
            Offset: 0
            Count: 100
            Total Row Count: 405031
            More pages: True

We get back 50495 files that belong to specimens that meet our search criteria. As before, we can preview the results by using the .to_dataframe() function:

In [4]:

            
                Copied!
                
myqueryspecimenfiles.to_dataframe()
myqueryspecimenfiles.to_dataframe()

Out[4]:

	id	identifier	label	data_category	data_type	file_format	associated_project	drs_uri	byte_size	checksum	data_modality	imaging_modality	dbgap_accession_number	subject_id	researchsubject_id	researchsubject_specimen_id
0	06668371-2079-4471-a06a-7ad6f987a96c	[{'system': 'GDC', 'value': '06668371-2079-447...	5861e7d6-2066-40d7-8868-c14a69255304.genie.ali...	Simple Nucleotide Variation	Masked Annotated Somatic Mutation	MAF	GENIE-NKI	drs://dg.4DFC:06668371-2079-4471-a06a-7ad6f987...	1052	e8d1a8534b9fe7a813f1b10a8a706daf	Genomic	None	None	GENIE-NKI-8IN3	54ada9c7-e8f7-4ecc-b829-903be7e3ad82	922c4731-aa49-4d6b-bbea-9a98de73438a
1	094daeed-db10-41fc-90be-ece59d235896	[{'system': 'GDC', 'value': '094daeed-db10-41f...	5a983470-47a1-4944-8530-c96ca794b266.genie.ali...	Simple Nucleotide Variation	Masked Annotated Somatic Mutation	MAF	GENIE-GRCC	drs://dg.4DFC:094daeed-db10-41fc-90be-ece59d23...	1843	40ad96e30309e253fb23f2449a70b6b8	Genomic	None	None	GENIE-GRCC-9b65f377	b96be1f4-f89c-418a-8dcf-5e3b6d64f50f	dd06356c-a84b-533a-b684-1b8e3f4915ae
2	0e08b693-1cc5-4b91-a970-f3f588a15cb8	[{'system': 'GDC', 'value': '0e08b693-1cc5-4b9...	c046dcd7-30ac-4d93-bfda-ee5f947b2652.genie.ali...	Simple Nucleotide Variation	Masked Annotated Somatic Mutation	MAF	GENIE-MDA	drs://dg.4DFC:0e08b693-1cc5-4b91-a970-f3f588a1...	3145	482f386bf108ab7cb6aff9afd96cc142	Genomic	None	None	GENIE-MDA-4319	531c5404-92c5-471b-9c7e-e3bd62dafa63	baacb261-4c7e-4a0d-bdd4-635eb20d4bb8
3	10e9c90a-bfd6-44ca-b6da-56203af61747	[{'system': 'GDC', 'value': '10e9c90a-bfd6-44c...	7ced15a9-4508-4c92-8c76-e20e56d1ea07.genie.ali...	Copy Number Variation	Gene Level Copy Number Scores	TSV	GENIE-MSK	drs://dg.4DFC:10e9c90a-bfd6-44ca-b6da-56203af6...	22932	352d5b9a3ea0b7edd466a6f60053b99e	Genomic	None	None	GENIE-MSK-P-0003830	2f6299e4-48b2-4c53-a6d3-2653abc4ca32	c9129afe-4202-547d-afca-5f9dab738d9b
4	177aefb8-32f4-404b-b811-dba5c83e0e5b	[{'system': 'GDC', 'value': '177aefb8-32f4-404...	TCGA-HT-7474-01A-21-A29Y-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-LGG	drs://dg.4DFC:177aefb8-32f4-404b-b811-dba5c83e...	21974	b48345a41d455fb506f2a9c74fd914a1	Genomic	None	None	TCGA-HT-7474	72aa812d-5daa-4bd7-9028-ec541b1f25bd	335f4b6b-c4b2-4624-81f2-8fc568d10f16
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	74e4f676-380d-4d13-a8eb-4faa006a6ce0	[{'system': 'GDC', 'value': '74e4f676-380d-4d1...	seurat.analysis.tsv	Transcriptome Profiling	Single Cell Analysis	TSV	CPTAC-3	drs://dg.4DFC:74e4f676-380d-4d13-a8eb-4faa006a...	2907801	64db788508e60bb12e417bf8b2bc5163	Genomic	None	phs001287	C3N-02190	805792e9-858e-4e9e-84cf-3578f7470889	c6d4eec2-5d13-53ec-91eb-fd52ada341f8
96	7e3ec592-40b9-4134-a5a6-5fa1385d488f	[{'system': 'GDC', 'value': '7e3ec592-40b9-413...	TCGA-DU-A5TU-01A-21-A29Z-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-LGG	drs://dg.4DFC:7e3ec592-40b9-4134-a5a6-5fa1385d...	22034	2e7141d1166c24c9a9d11a4b35d34c51	Genomic	None	None	TCGA-DU-A5TU	1c095c4a-df97-402e-902c-fc83bcf8ef88	c20e4e8b-af67-40db-81b5-b8a8976ab9a1
97	9148502d-cb91-41c6-b382-62639010ff4e	[{'system': 'GDC', 'value': '9148502d-cb91-41c...	e1c60464-3f4f-400b-bbfa-0e96f826d901.genie.ali...	Simple Nucleotide Variation	Masked Annotated Somatic Mutation	MAF	GENIE-MSK	drs://dg.4DFC:9148502d-cb91-41c6-b382-62639010...	3825	20d837abab0ffa04e06e5de5f35108e8	Genomic	None	None	GENIE-MSK-P-0002799	b018ac46-9fbb-4f68-8abb-0ec1116903b4	8c3b2317-5b60-5871-aba4-7eddd1bc727a
98	94438748-a273-4673-aec4-687a6c1051e3	[{'system': 'GDC', 'value': '94438748-a273-467...	TCGA-76-6656-01A-13-1900-20_RPPA_data.tsv	Proteome Profiling	Protein Expression Quantification	TSV	TCGA-GBM	drs://dg.4DFC:94438748-a273-4673-aec4-687a6c10...	24025	1aedf8b68a4ac55c6f736acaaa74ee9d	Genomic	None	None	TCGA-76-6656	770aa1ee-aed9-4219-900e-63523cdf312f	ed2a500e-f6db-48d2-ade2-9e8f8ed004dc
99	a17f7c1a-da9e-4876-b5fc-63d683ccbb7e	[{'system': 'GDC', 'value': 'a17f7c1a-da9e-487...	seurat.deg.tsv	Transcriptome Profiling	Differential Gene Expression	TSV	CPTAC-3	drs://dg.4DFC:a17f7c1a-da9e-4876-b5fc-63d683cc...	278571	a986dfb0a6f9d5ef35b6f3c05208677f	Genomic	None	phs001287	C3N-02190	805792e9-858e-4e9e-84cf-3578f7470889	950a9f78-2d18-47fa-bb63-fef7197e08c9

100 rows × 16 columns

Valid Endpoint Chains

Not all endpoints can be chained together. This is a restriction caused by the data itself. `diagnosis` and `treatment` information does not have files directly attached to it, instead these files are associated with the `researchsubject`, as such both "myquery.treatment.files.run()" and "myquery.diagnosis.files.run()" will fail, as there are no files to retrieve. Valid chains are:

myquery.subject.file.run: This will return all the files that meet the query and that are directly tied to subject
myquery.researchsubject.file.run:This will return all the files that meet the query and that are directly tied to researchsubject
myquery.specimen.file.run: This will return all the files that meet the query and that are directly tied to specimen
myquery.subject.file.count.run: This will return the count of files that meet the query and that are directly tied to subject
myquery.researchsubject.file.count.run:This will return the count of files that meet the query and that are directly tied to researchsubject
myquery.specimen.file.count.run: This will return the count of files that meet the query and that are directly tied to specimen

Last update: 2022-06-22