Chaining Search Terms¶
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")
2022.6.22
Endpoint Chaining¶
We're going to build on our previous search to see what information exists about cancers that were first diagnosed in the brain.
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')
Previously we looked at subject, research_subject, specimen and file results seperately, but we can also combine these.
Let's say what we're really interested in is finding analysis done on specimens, so we're looking for files that belong to specimens that match our search. If we search the files endpoint directly, we'll get back all files that meet our search criteria, regardless of whether they are specimens specficially. Instead, we can chain our query to the specimen endpoint and then to the files endpoint and get the combined result:
myqueryspecimenfiles = myquery.specimen.file.run()
myqueryspecimenfiles
Getting results from database
Total execution time: 4028 ms
QueryID: bbbaa1a9-ead3-41b2-83c9-2538464ced59 Offset: 0 Count: 100 Total Row Count: 405031 More pages: True
We get back 50495 files that belong to specimens that meet our search criteria. As before, we can preview the results by using the .to_dataframe()
function:
myqueryspecimenfiles.to_dataframe()
id | identifier | label | data_category | data_type | file_format | associated_project | drs_uri | byte_size | checksum | data_modality | imaging_modality | dbgap_accession_number | subject_id | researchsubject_id | researchsubject_specimen_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 06668371-2079-4471-a06a-7ad6f987a96c | [{'system': 'GDC', 'value': '06668371-2079-447... | 5861e7d6-2066-40d7-8868-c14a69255304.genie.ali... | Simple Nucleotide Variation | Masked Annotated Somatic Mutation | MAF | GENIE-NKI | drs://dg.4DFC:06668371-2079-4471-a06a-7ad6f987... | 1052 | e8d1a8534b9fe7a813f1b10a8a706daf | Genomic | None | None | GENIE-NKI-8IN3 | 54ada9c7-e8f7-4ecc-b829-903be7e3ad82 | 922c4731-aa49-4d6b-bbea-9a98de73438a |
1 | 094daeed-db10-41fc-90be-ece59d235896 | [{'system': 'GDC', 'value': '094daeed-db10-41f... | 5a983470-47a1-4944-8530-c96ca794b266.genie.ali... | Simple Nucleotide Variation | Masked Annotated Somatic Mutation | MAF | GENIE-GRCC | drs://dg.4DFC:094daeed-db10-41fc-90be-ece59d23... | 1843 | 40ad96e30309e253fb23f2449a70b6b8 | Genomic | None | None | GENIE-GRCC-9b65f377 | b96be1f4-f89c-418a-8dcf-5e3b6d64f50f | dd06356c-a84b-533a-b684-1b8e3f4915ae |
2 | 0e08b693-1cc5-4b91-a970-f3f588a15cb8 | [{'system': 'GDC', 'value': '0e08b693-1cc5-4b9... | c046dcd7-30ac-4d93-bfda-ee5f947b2652.genie.ali... | Simple Nucleotide Variation | Masked Annotated Somatic Mutation | MAF | GENIE-MDA | drs://dg.4DFC:0e08b693-1cc5-4b91-a970-f3f588a1... | 3145 | 482f386bf108ab7cb6aff9afd96cc142 | Genomic | None | None | GENIE-MDA-4319 | 531c5404-92c5-471b-9c7e-e3bd62dafa63 | baacb261-4c7e-4a0d-bdd4-635eb20d4bb8 |
3 | 10e9c90a-bfd6-44ca-b6da-56203af61747 | [{'system': 'GDC', 'value': '10e9c90a-bfd6-44c... | 7ced15a9-4508-4c92-8c76-e20e56d1ea07.genie.ali... | Copy Number Variation | Gene Level Copy Number Scores | TSV | GENIE-MSK | drs://dg.4DFC:10e9c90a-bfd6-44ca-b6da-56203af6... | 22932 | 352d5b9a3ea0b7edd466a6f60053b99e | Genomic | None | None | GENIE-MSK-P-0003830 | 2f6299e4-48b2-4c53-a6d3-2653abc4ca32 | c9129afe-4202-547d-afca-5f9dab738d9b |
4 | 177aefb8-32f4-404b-b811-dba5c83e0e5b | [{'system': 'GDC', 'value': '177aefb8-32f4-404... | TCGA-HT-7474-01A-21-A29Y-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-LGG | drs://dg.4DFC:177aefb8-32f4-404b-b811-dba5c83e... | 21974 | b48345a41d455fb506f2a9c74fd914a1 | Genomic | None | None | TCGA-HT-7474 | 72aa812d-5daa-4bd7-9028-ec541b1f25bd | 335f4b6b-c4b2-4624-81f2-8fc568d10f16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 74e4f676-380d-4d13-a8eb-4faa006a6ce0 | [{'system': 'GDC', 'value': '74e4f676-380d-4d1... | seurat.analysis.tsv | Transcriptome Profiling | Single Cell Analysis | TSV | CPTAC-3 | drs://dg.4DFC:74e4f676-380d-4d13-a8eb-4faa006a... | 2907801 | 64db788508e60bb12e417bf8b2bc5163 | Genomic | None | phs001287 | C3N-02190 | 805792e9-858e-4e9e-84cf-3578f7470889 | c6d4eec2-5d13-53ec-91eb-fd52ada341f8 |
96 | 7e3ec592-40b9-4134-a5a6-5fa1385d488f | [{'system': 'GDC', 'value': '7e3ec592-40b9-413... | TCGA-DU-A5TU-01A-21-A29Z-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-LGG | drs://dg.4DFC:7e3ec592-40b9-4134-a5a6-5fa1385d... | 22034 | 2e7141d1166c24c9a9d11a4b35d34c51 | Genomic | None | None | TCGA-DU-A5TU | 1c095c4a-df97-402e-902c-fc83bcf8ef88 | c20e4e8b-af67-40db-81b5-b8a8976ab9a1 |
97 | 9148502d-cb91-41c6-b382-62639010ff4e | [{'system': 'GDC', 'value': '9148502d-cb91-41c... | e1c60464-3f4f-400b-bbfa-0e96f826d901.genie.ali... | Simple Nucleotide Variation | Masked Annotated Somatic Mutation | MAF | GENIE-MSK | drs://dg.4DFC:9148502d-cb91-41c6-b382-62639010... | 3825 | 20d837abab0ffa04e06e5de5f35108e8 | Genomic | None | None | GENIE-MSK-P-0002799 | b018ac46-9fbb-4f68-8abb-0ec1116903b4 | 8c3b2317-5b60-5871-aba4-7eddd1bc727a |
98 | 94438748-a273-4673-aec4-687a6c1051e3 | [{'system': 'GDC', 'value': '94438748-a273-467... | TCGA-76-6656-01A-13-1900-20_RPPA_data.tsv | Proteome Profiling | Protein Expression Quantification | TSV | TCGA-GBM | drs://dg.4DFC:94438748-a273-4673-aec4-687a6c10... | 24025 | 1aedf8b68a4ac55c6f736acaaa74ee9d | Genomic | None | None | TCGA-76-6656 | 770aa1ee-aed9-4219-900e-63523cdf312f | ed2a500e-f6db-48d2-ade2-9e8f8ed004dc |
99 | a17f7c1a-da9e-4876-b5fc-63d683ccbb7e | [{'system': 'GDC', 'value': 'a17f7c1a-da9e-487... | seurat.deg.tsv | Transcriptome Profiling | Differential Gene Expression | TSV | CPTAC-3 | drs://dg.4DFC:a17f7c1a-da9e-4876-b5fc-63d683cc... | 278571 | a986dfb0a6f9d5ef35b6f3c05208677f | Genomic | None | phs001287 | C3N-02190 | 805792e9-858e-4e9e-84cf-3578f7470889 | 950a9f78-2d18-47fa-bb63-fef7197e08c9 |
100 rows × 16 columns
Valid Endpoint Chains
Not all endpoints can be chained together. This is a restriction caused by the data itself. `diagnosis` and `treatment` information does not have files directly attached to it, instead these files are associated with the `researchsubject`, as such both "myquery.treatment.files.run()" and "myquery.diagnosis.files.run()" will fail, as there are no files to retrieve. Valid chains are:- myquery.subject.file.run: This will return all the files that meet the query and that are directly tied to subject
- myquery.researchsubject.file.run:This will return all the files that meet the query and that are directly tied to researchsubject
- myquery.specimen.file.run: This will return all the files that meet the query and that are directly tied to specimen
- myquery.subject.file.count.run: This will return the count of files that meet the query and that are directly tied to subject
- myquery.researchsubject.file.count.run:This will return the count of files that meet the query and that are directly tied to researchsubject
- myquery.specimen.file.count.run: This will return the count of files that meet the query and that are directly tied to specimen