Available Search Terms¶
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
2022.6.22
CDA data comes from three sources:
- The Proteomic Data Commons (PDC)
- The Genomic Data Commons (GDC)
- The Imaging Data Commons (IDC)
The CDA makes this data searchable in four main endpoints:
subject
: A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasetsresearchsubject
: a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDsspecimen
: a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject IDfile
: A unit of data about subjects, researchsubjects, specimens, or their associated information
and two endpoints that offer deeper information about data in the researchsubject endpoint:
diagnosis
: Information about what medical diagnosis a researchsubject hastreatment
: Information about what medical treatment(s) were performed for a given diagnosis
You search any metadata field from any endpoint, the only difference between search types is what type of data you return by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
Accordingly, to see what search fields are available, we use the command columns
:
columns()
QueryID: cb45e354-514e-43ec-a2a6-8f3eca1696a0 Offset: 0 Count: 64 Total Row Count: 64 More pages: False
This output tells us that there are 74 searchable fields, but it doesn't output them directly. Running CDA commands like this first gives you an overall summary of the data you're going to get, and so is nice for doing a gut check. However, if we want to see the data on our screen we can have columns()
print out it's contents to a list instead:
columns().to_list()
['File.id', 'File.identifier.system', 'File.identifier.value', 'File.label', 'File.data_category', 'File.data_type', 'File.file_format', 'File.associated_project', 'File.drs_uri', 'File.byte_size', 'File.checksum', 'File.data_modality', 'File.imaging_modality', 'File.dbgap_accession_number', 'id', 'identifier.system', 'identifier.value', 'species', 'sex', 'race', 'ethnicity', 'days_to_birth', 'subject_associated_project', 'vital_status', 'age_at_death', 'cause_of_death', 'ResearchSubject.id', 'ResearchSubject.identifier.system', 'ResearchSubject.identifier.value', 'ResearchSubject.member_of_research_project', 'ResearchSubject.primary_diagnosis_condition', 'ResearchSubject.primary_diagnosis_site', 'ResearchSubject.Diagnosis.id', 'ResearchSubject.Diagnosis.identifier.system', 'ResearchSubject.Diagnosis.identifier.value', 'ResearchSubject.Diagnosis.primary_diagnosis', 'ResearchSubject.Diagnosis.age_at_diagnosis', 'ResearchSubject.Diagnosis.morphology', 'ResearchSubject.Diagnosis.stage', 'ResearchSubject.Diagnosis.grade', 'ResearchSubject.Diagnosis.method_of_diagnosis', 'ResearchSubject.Diagnosis.Treatment.id', 'ResearchSubject.Diagnosis.Treatment.identifier.system', 'ResearchSubject.Diagnosis.Treatment.identifier.value', 'ResearchSubject.Diagnosis.Treatment.treatment_type', 'ResearchSubject.Diagnosis.Treatment.treatment_outcome', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end', 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent', 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site', 'ResearchSubject.Diagnosis.Treatment.treatment_effect', 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason', 'ResearchSubject.Diagnosis.Treatment.number_of_cycles', 'ResearchSubject.Specimen.id', 'ResearchSubject.Specimen.identifier.system', 'ResearchSubject.Specimen.identifier.value', 'ResearchSubject.Specimen.associated_project', 'ResearchSubject.Specimen.age_at_collection', 'ResearchSubject.Specimen.primary_disease_type', 'ResearchSubject.Specimen.anatomical_site', 'ResearchSubject.Specimen.source_material_type', 'ResearchSubject.Specimen.specimen_type', 'ResearchSubject.Specimen.derived_from_specimen', 'ResearchSubject.Specimen.derived_from_subject']
By default, columns()
returns the first 100 items. If that is too many, you can limit your search to only a specified number:
columns(limit=10).to_list()
['File.id', 'File.identifier.system', 'File.identifier.value', 'File.label', 'File.data_category', 'File.data_type', 'File.file_format', 'File.associated_project', 'File.drs_uri', 'File.byte_size']
Or you can filter the list for terms that match your interests:
columns().to_list(filters="diagnosis")
['ResearchSubject.primary_diagnosis_condition', 'ResearchSubject.primary_diagnosis_site', 'ResearchSubject.Diagnosis.id', 'ResearchSubject.Diagnosis.identifier.system', 'ResearchSubject.Diagnosis.identifier.value', 'ResearchSubject.Diagnosis.primary_diagnosis', 'ResearchSubject.Diagnosis.age_at_diagnosis', 'ResearchSubject.Diagnosis.morphology', 'ResearchSubject.Diagnosis.stage', 'ResearchSubject.Diagnosis.grade', 'ResearchSubject.Diagnosis.method_of_diagnosis', 'ResearchSubject.Diagnosis.Treatment.id', 'ResearchSubject.Diagnosis.Treatment.identifier.system', 'ResearchSubject.Diagnosis.Treatment.identifier.value', 'ResearchSubject.Diagnosis.Treatment.treatment_type', 'ResearchSubject.Diagnosis.Treatment.treatment_outcome', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start', 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end', 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent', 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site', 'ResearchSubject.Diagnosis.Treatment.treatment_effect', 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason', 'ResearchSubject.Diagnosis.Treatment.number_of_cycles']
We can directly get information about what data populates any of these fields using the unique_terms()
function. Like columns
, unique_terms
defaults to giving us an overview of the results, and we view them the same way:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()
['Abdomen', 'Abdomen, Mediastinum', 'Adrenal Glands', 'Adrenal gland', 'Anus and anal canal', 'Base of tongue', 'Bile Duct', 'Bladder', 'Bones, joints and articular cartilage of limbs', 'Bones, joints and articular cartilage of other and unspecified sites', 'Brain', 'Breast', 'Bronchus and lung', 'Cervix', 'Cervix uteri', 'Chest', 'Chest-Abdomen-Pelvis, Leg, TSpine', 'Colon', 'Connective, subcutaneous and other soft tissues', 'Corpus uteri', 'Ear', 'Esophagus', 'Extremities', 'Eye and adnexa', 'Floor of mouth', 'Gallbladder', 'Gum', 'Head', 'Head and Neck', 'Head-Neck', 'Heart, mediastinum, and pleura', 'Hematopoietic and reticuloendothelial systems', 'Hypopharynx', 'Intraocular', 'Kidney', 'Larynx', 'Lip', 'Liver', 'Liver and intrahepatic bile ducts', 'Lung', 'Lung Phantom', 'Lymph nodes', 'Marrow, Blood', 'Meninges', 'Mesothelium', 'Nasal cavity and middle ear', 'Nasopharynx', 'Not Reported', 'Oropharynx', 'Other and ill-defined digestive organs', 'Other and ill-defined sites', 'Other and ill-defined sites in lip, oral cavity and pharynx', 'Other and ill-defined sites within respiratory system and intrathoracic organs', 'Other and unspecified female genital organs', 'Other and unspecified major salivary glands', 'Other and unspecified male genital organs', 'Other and unspecified parts of biliary tract', 'Other and unspecified parts of mouth', 'Other and unspecified parts of tongue', 'Other and unspecified urinary organs', 'Other endocrine glands and related structures', 'Ovary', 'Palate', 'Pancreas', 'Pancreas ', 'Pelvis, Prostate, Anus', 'Penis', 'Peripheral nerves and autonomic nervous system', 'Phantom', 'Prostate', 'Prostate gland', 'Rectosigmoid junction', 'Rectum', 'Renal pelvis', 'Retroperitoneum and peritoneum', 'Skin', 'Small intestine', 'Spinal cord, cranial nerves, and other parts of central nervous system', 'Stomach', 'Testicles', 'Testis', 'Thymus', 'Thyroid', 'Thyroid gland', 'Tonsil', 'Trachea', 'Unknown', 'Ureter', 'Uterus', 'Uterus, NOS', 'Vagina', 'Various', 'Various (11 locations)', 'Vulva']
We can use the same trick here to search for only diagnosis sites that we're interested in:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung")
['Bronchus and lung', 'Lung', 'Lung Phantom']
We can use this same logic to look for partial matches. For instance, if I'm not sure whether the data I'm interested in would be labeled as "uterine" or "uterus" I might search for just "uter"
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']
Success! Not only are there multiple ways that "Uterus" is specified in the CDA data, I now also know that there are also data for specific uterine tissues.
Explore the available terms by changing which table, how many results, and which unique terms you request. Once you have found terms you're interested in, head to Basic Search to build simple queries.