Available Search Terms¶

Before we do any work, we need to import several functions from cdapython:

Q and query which power the search
columns which lets us view entity field names
unique_terms which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:

            
                Copied!
                
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.22

CDA data comes from three sources:

The CDA makes this data searchable in four main endpoints:

subject: A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets
researchsubject: a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs
specimen: a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID
file: A unit of data about subjects, researchsubjects, specimens, or their associated information

and two endpoints that offer deeper information about data in the researchsubject endpoint:

diagnosis: Information about what medical diagnosis a researchsubject has
treatment: Information about what medical treatment(s) were performed for a given diagnosis

You search any metadata field from any endpoint, the only difference between search types is what type of data you return by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.

Accordingly, to see what search fields are available, we use the command columns:

In [2]:

            
                Copied!
                
columns()
columns()

Out[2]:

            QueryID: cb45e354-514e-43ec-a2a6-8f3eca1696a0
            
            Offset: 0
            Count: 64
            Total Row Count: 64
            More pages: False

This output tells us that there are 74 searchable fields, but it doesn't output them directly. Running CDA commands like this first gives you an overall summary of the data you're going to get, and so is nice for doing a gut check. However, if we want to see the data on our screen we can have columns() print out it's contents to a list instead:

In [3]:

            
                Copied!
                
columns().to_list()
columns().to_list()

Out[3]:

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'age_at_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'ResearchSubject.Diagnosis.Treatment.treatment_effect',
 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',
 'ResearchSubject.Diagnosis.Treatment.number_of_cycles',
 'ResearchSubject.Specimen.id',
 'ResearchSubject.Specimen.identifier.system',
 'ResearchSubject.Specimen.identifier.value',
 'ResearchSubject.Specimen.associated_project',
 'ResearchSubject.Specimen.age_at_collection',
 'ResearchSubject.Specimen.primary_disease_type',
 'ResearchSubject.Specimen.anatomical_site',
 'ResearchSubject.Specimen.source_material_type',
 'ResearchSubject.Specimen.specimen_type',
 'ResearchSubject.Specimen.derived_from_specimen',
 'ResearchSubject.Specimen.derived_from_subject']

By default, columns() returns the first 100 items. If that is too many, you can limit your search to only a specified number:

In [4]:

            
                Copied!
                
columns(limit=10).to_list()
columns(limit=10).to_list()

Out[4]:

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size']

Or you can filter the list for terms that match your interests:

In [5]:

            
                Copied!
                
columns().to_list(filters="diagnosis")
columns().to_list(filters="diagnosis")

Out[5]:

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'ResearchSubject.Diagnosis.Treatment.treatment_effect',
 'ResearchSubject.Diagnosis.Treatment.treatment_end_reason',
 'ResearchSubject.Diagnosis.Treatment.number_of_cycles']

Check your search criteria! While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in CDA Schema Field Mapping.

We can directly get information about what data populates any of these fields using the unique_terms() function. Like columns, unique_terms defaults to giving us an overview of the results, and we view them the same way:

In [6]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

Out[6]:

['Abdomen',
 'Abdomen, Mediastinum',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other and ill-defined digestive organs',
 'Other and ill-defined sites',
 'Other and ill-defined sites in lip, oral cavity and pharynx',
 'Other and ill-defined sites within respiratory system and intrathoracic organs',
 'Other and unspecified female genital organs',
 'Other and unspecified major salivary glands',
 'Other and unspecified male genital organs',
 'Other and unspecified parts of biliary tract',
 'Other and unspecified parts of mouth',
 'Other and unspecified parts of tongue',
 'Other and unspecified urinary organs',
 'Other endocrine glands and related structures',
 'Ovary',
 'Palate',
 'Pancreas',
 'Pancreas ',
 'Pelvis, Prostate, Anus',
 'Penis',
 'Peripheral nerves and autonomic nervous system',
 'Phantom',
 'Prostate',
 'Prostate gland',
 'Rectosigmoid junction',
 'Rectum',
 'Renal pelvis',
 'Retroperitoneum and peritoneum',
 'Skin',
 'Small intestine',
 'Spinal cord, cranial nerves, and other parts of central nervous system',
 'Stomach',
 'Testicles',
 'Testis',
 'Thymus',
 'Thyroid',
 'Thyroid gland',
 'Tonsil',
 'Trachea',
 'Unknown',
 'Ureter',
 'Uterus',
 'Uterus, NOS',
 'Vagina',
 'Various',
 'Various (11 locations)',
 'Vulva']

We can use the same trick here to search for only diagnosis sites that we're interested in:

In [7]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung")
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung")

Out[7]:

['Bronchus and lung', 'Lung', 'Lung Phantom']

We can use this same logic to look for partial matches. For instance, if I'm not sure whether the data I'm interested in would be labeled as "uterine" or "uterus" I might search for just "uter"

In [8]:

            
                Copied!
                
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

Out[8]:

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

Success! Not only are there multiple ways that "Uterus" is specified in the CDA data, I now also know that there are also data for specific uterine tissues.

Check your search terms! If you run into unexpected results when running a search, be sure that you're searching all the terms you want. CDA data is not yet harmonized across centers, so there are many cases where a single term search will not return all the information you need, however the CDA provides tools that make it easy to search all forms of a term to enable cross dataset search.

Explore the available terms by changing which table, how many results, and which unique terms you request. Once you have found terms you're interested in, head to Basic Search to build simple queries.

Last update: 2022-06-15