Advanced Search with Operators¶
Before we do any work, we need to import several functions from cdapython:
Q
andquery
which power the searchcolumns
which lets us view entity field namesunique_terms
which lets view entity field contents
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")
2022.6.22
The CDA provides a custom python tool for searching CDA data. Q
(short for Query) offers several ways to search and filter data, and several input modes:
- Q.() builds a query that can be used by
run()
orcount()
- Q.run() returns data for the specified search
- Q.count() returns summary information (counts) data that fit the specified search
Operators allow us to make more complex queries by adding, subtracting, or filtering data.
Q
uses the following operators:
=
: Equals!=
: Not EqualOR
AND
IN
andNOT IN
%
: pattern matching a wildcardIS
andIS NOT
>
,>
,>=
,<=
: Greater and Less than
We use these operators to build more and more complex Q statements before sending our query to run()
or count()
.
Equals: =
¶
In the other tutorials, we have always used the same query, which uses the =
operator.
Q('ResearchSubject.primary_diagnosis_site = "brain"')
This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:
Q('ResearchSubject.primary_diagnosis_site = "uterus"').researchsubject.count.run()
Getting results from database
Total execution time: 4404 ms
total : 867
files : 242362
system | count |
---|---|
IDC | 867 |
primary_diagnosis_condition | count |
---|---|
None | 867 |
primary_diagnosis_site | count |
---|---|
Uterus | 867 |
Not Equal: !=
¶
The !=
operator does the opposite of the =
operator, it returns everything that is not exactly the term you give it:
Q('ResearchSubject.primary_diagnosis_site != "uterus"').researchsubject.count.run()
Getting results from database
Total execution time: 3605 ms
total : 148831
files : 39151844
system | count |
---|---|
GDC | 85416 |
IDC | 61081 |
PDC | 2334 |
primary_diagnosis_condition | count |
---|---|
Cystic, Mucinous and Serous Neoplasms | 3723 |
Gliomas | 4772 |
Ductal and Lobular Neoplasms | 7870 |
Adenomas and Adenocarcinomas | 32730 |
Complex Mixed and Stromal Neoplasms | 1826 |
Breast Invasive Carcinoma | 251 |
Nevi and Melanomas | 3155 |
Squamous Cell Neoplasms | 5076 |
Transitional Cell Papillomas and Carcinomas | 1885 |
Plasma Cell Tumors | 1066 |
Neuroepitheliomatous Neoplasms | 1332 |
None | 61083 |
Thymic Epithelial Neoplasms | 262 |
Myeloid Leukemias | 3965 |
Synovial-like Neoplasms | 98 |
Epithelial Neoplasms, NOS | 5694 |
Glioblastoma | 100 |
Other | 206 |
Lymphoid Leukemias | 2072 |
Paragangliomas and Glomus Tumors | 241 |
Colon Adenocarcinoma | 164 |
Lung Squamous Cell Carcinoma | 118 |
Fibromatous Neoplasms | 322 |
Germ Cell Neoplasms | 703 |
Osseous and Chondromatous Neoplasms | 615 |
Mature B-Cell Lymphomas | 1019 |
Lung Adenocarcinoma | 216 |
Specialized Gonadal Neoplasms | 124 |
Mesothelial Neoplasms | 647 |
Myomatous Neoplasms | 632 |
Complex Epithelial Neoplasms | 254 |
Pancreatic Ductal Adenocarcinoma | 144 |
Nerve Sheath Tumors | 115 |
Myelodysplastic Syndromes | 386 |
Ovarian Serous Cystadenocarcinoma | 283 |
Uterine Corpus Endometrial Carcinoma | 104 |
Hepatocellular Carcinoma | 170 |
Clear Cell Renal Cell Carcinoma | 116 |
Soft Tissue Tumors and Sarcomas, NOS | 315 |
Rectum Adenocarcinoma | 30 |
Neoplasms, NOS | 1251 |
Mature T- and NK-Cell Lymphomas | 94 |
Acinar Cell Neoplasms | 300 |
Oral Squamous Cell Carcinoma | 38 |
Not Applicable | 440 |
Pediatric/AYA Brain Tumors | 199 |
Not Reported | 271 |
Head and Neck Squamous Cell Carcinoma | 110 |
Early Onset Gastric Cancer | 80 |
Lipomatous Neoplasms | 343 |
Leukemias, NOS | 118 |
Adnexal and Skin Appendage Neoplasms | 58 |
Miscellaneous Tumors | 89 |
Blood Vessel Tumors | 156 |
Mucoepidermoid Neoplasms | 60 |
Chronic Myeloproliferative Disorders | 476 |
Meningiomas | 289 |
Unknown | 63 |
Miscellaneous Bone Tumors | 130 |
Mesonephromas | 5 |
Basal Cell Neoplasms | 45 |
Myxomatous Neoplasms | 18 |
Other Hematologic Disorders | 20 |
Other Leukemias | 68 |
Immunoproliferative Diseases | 4 |
Fibroepithelial Neoplasms | 25 |
Precursor Cell Lymphoblastic Lymphoma | 12 |
Malignant Lymphomas, NOS or Diffuse | 42 |
Neoplasms of Histiocytes and Accessory Lymphoid Cells | 66 |
Trophoblastic neoplasms | 21 |
Mast Cell Tumors | 10 |
Giant Cell Tumors | 3 |
Hodgkin Lymphoma | 11 |
Papillary Renal Cell Carcinoma | 2 |
Granular Cell Tumors and Alveolar Soft Part Sarcomas | 23 |
Odontogenic Tumors | 3 |
Lymphatic Vessel Tumors | 1 |
Acute Myeloid Leukemia | 2 |
Chromophobe Renal Cell Carcinoma | 1 |
primary_diagnosis_site | count |
---|---|
Breast | 21945 |
Hematopoietic and reticuloendothelial systems | 9007 |
Not Reported | 506 |
Prostate gland | 2354 |
Kidney | 4788 |
Bronchus and lung | 12256 |
Adrenal gland | 851 |
Thyroid gland | 1880 |
Brain | 2923 |
Head and Neck | 148 |
Retroperitoneum and peritoneum | 384 |
Pancreas | 3352 |
Connective, subcutaneous and other soft tissues | 1573 |
Skin | 3497 |
Head-Neck | 2704 |
Colon | 8559 |
Corpus uteri | 780 |
Cervix uteri | 915 |
Ovary | 4346 |
Liver and intrahepatic bile ducts | 1609 |
Lymph nodes | 538 |
Bladder | 2155 |
Other and unspecified major salivary glands | 615 |
Other and ill-defined sites | 1186 |
Other and unspecified parts of tongue | 133 |
Chest | 28221 |
Lung | 4848 |
Rectum | 1308 |
Uterus, NOS | 1998 |
Stomach | 1870 |
Unknown | 3232 |
Other and ill-defined sites in lip, oral cavity and pharynx | 361 |
Larynx | 169 |
Other and unspecified urinary organs | 217 |
Spinal cord, cranial nerves, and other parts of central nervous system | 3703 |
Oropharynx | 194 |
Bones, joints and articular cartilage of limbs | 268 |
Heart, mediastinum, and pleura | 706 |
Esophagus | 1580 |
Rectosigmoid junction | 81 |
Eye and adnexa | 222 |
Abdomen | 92 |
Penis | 33 |
Other and unspecified parts of mouth | 43 |
Thymus | 431 |
Other and ill-defined digestive organs | 719 |
Bones, joints and articular cartilage of other and unspecified sites | 455 |
Other and unspecified female genital organs | 160 |
Various | 449 |
Testis | 542 |
Liver | 622 |
Floor of mouth | 56 |
Gum | 11 |
Abdomen, Mediastinum | 176 |
Various (11 locations) | 89 |
Prostate | 2139 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Testicles | 150 |
Thyroid | 507 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Adrenal Glands | 271 |
Mesothelium | 87 |
Cervix | 307 |
Head | 105 |
Ear | 242 |
Base of tongue | 24 |
Other and unspecified parts of biliary tract | 224 |
Hypopharynx | 25 |
Small intestine | 269 |
Bile Duct | 51 |
Tonsil | 46 |
None | 340 |
Peripheral nerves and autonomic nervous system | 418 |
Trachea | 7 |
Anus and anal canal | 233 |
Marrow, Blood | 89 |
Meninges | 243 |
Other endocrine glands and related structures | 181 |
Gallbladder | 265 |
Intraocular | 80 |
Nasopharynx | 101 |
Vagina | 72 |
Nasal cavity and middle ear | 40 |
Vulva | 10 |
Lung Phantom | 8 |
Lip | 9 |
Renal pelvis | 1 |
Palate | 5 |
Ureter | 15 |
Pancreas | 1 |
Other and ill-defined sites within respiratory system and intrathoracic organs | 2 |
Other and unspecified male genital organs | 1 |
Note that in our !=
results, there are 1998 "Uterus, NOS" samples. These don't appear in our =
search because "Uterus, NOS" is not exactly "Uterus".
There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.
OR¶
If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an OR
query, each data point only needs to meet a single piece of criteria to be returned, this makes OR
good for early, broad searches. It increases the amount of data returned.
OR
can be used both inside a Q statement:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Getting results from database
Total execution time: 3414 ms
total : 2865
files : 257140
system | count |
---|---|
PDC | 104 |
GDC | 1894 |
IDC | 867 |
primary_diagnosis_condition | count |
---|---|
Complex Mixed and Stromal Neoplasms | 294 |
Uterine Corpus Endometrial Carcinoma | 104 |
Adenomas and Adenocarcinomas | 1037 |
None | 867 |
Myomatous Neoplasms | 183 |
Cystic, Mucinous and Serous Neoplasms | 313 |
Not Reported | 12 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Complex Epithelial Neoplasms | 2 |
Epithelial Neoplasms, NOS | 20 |
Mesonephromas | 2 |
Trophoblastic neoplasms | 13 |
Neoplasms, NOS | 3 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 1998 |
Uterus | 867 |
and to combine 2 or more Q statements:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"')
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')
Query1.OR(Query2).researchsubject.count.run()
Getting results from database
Total execution time: 3572 ms
total : 1998
files : 14778
system | count |
---|---|
PDC | 104 |
GDC | 1894 |
primary_diagnosis_condition | count |
---|---|
Myomatous Neoplasms | 183 |
Uterine Corpus Endometrial Carcinoma | 104 |
Cystic, Mucinous and Serous Neoplasms | 313 |
Adenomas and Adenocarcinomas | 1037 |
Complex Mixed and Stromal Neoplasms | 294 |
Not Reported | 12 |
Epithelial Neoplasms, NOS | 20 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Trophoblastic neoplasms | 13 |
Neoplasms, NOS | 3 |
Complex Epithelial Neoplasms | 2 |
Mesonephromas | 2 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 1998 |
For each OR
you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the OR
operator is flexible enough to run searches across columns, or even across endpoints.
AND¶
Like OR
, AND
can be used both inside a Q statement, and to join multiple Q statements. AND
requires that both statements be true simultanously for each returned bit of data. This makes AND
good for filtering down results. It decreases the amount of data returned.
If we reuse the OR
examples above, the first one will have no results, because primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":
Q('ResearchSubject.primary_diagnosis_site = "uterus" AND ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()
Getting results from database
Total execution time: 3602 ms
total : 0
files : 0
However, for searches where you are interested in subsetting multiple columns, AND
can help you to quickly filter to only the set you want. Note that AND
can be used both inside a Q
statement, and to add multiple Q
statements together:
Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS" AND ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()
Getting results from database
Total execution time: 3538 ms
total : 104
files : 2560
system | count |
---|---|
PDC | 104 |
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 104 |
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"')
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')
Query1.AND(Query2).researchsubject.count.run()
Getting results from database
Total execution time: 3401 ms
total : 104
files : 2560
system | count |
---|---|
PDC | 104 |
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 104 |
IN
and NOT IN
¶
For instances where you have many search terms, it may be easier (and more readable) to use IN
. With IN
you make a list of all the terms you are interested in, and ask whether they are IN
a given field:
Q('ResearchSubject.primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Getting results from database
Total execution time: 3463 ms
total : 4087
files : 299493
system | count |
---|---|
PDC | 104 |
GDC | 2809 |
IDC | 1174 |
primary_diagnosis_condition | count |
---|---|
Uterine Corpus Endometrial Carcinoma | 104 |
Myomatous Neoplasms | 183 |
Adenomas and Adenocarcinomas | 1264 |
Squamous Cell Neoplasms | 609 |
Cystic, Mucinous and Serous Neoplasms | 348 |
Complex Mixed and Stromal Neoplasms | 294 |
Complex Epithelial Neoplasms | 27 |
None | 1175 |
Not Reported | 12 |
Epithelial Neoplasms, NOS | 26 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Neoplasms, NOS | 12 |
Trophoblastic neoplasms | 13 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Cervix uteri | 915 |
Uterus, NOS | 1998 |
Uterus | 867 |
Cervix | 307 |
The equivilent request without IN
would require a large number of OR
statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):
Q("""ResearchSubject.primary_diagnosis_site = "uterus, NOS" OR
ResearchSubject.primary_diagnosis_site = "uterus" OR
ResearchSubject.primary_diagnosis_site = "Cervix" OR
ResearchSubject.primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()
NOT IN
is the opposite of IN
, and so gives the inverse results. If we add NOT
to our above query, we get all the researchsubjecst who's primary_diagnosis_site was not in our list:
Q('ResearchSubject.primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()
Getting results from database
Total execution time: 3551 ms
total : 145611
files : 39097313
system | count |
---|---|
GDC | 82607 |
IDC | 60774 |
PDC | 2230 |
primary_diagnosis_condition | count |
---|---|
Adenomas and Adenocarcinomas | 31466 |
Other | 206 |
Ductal and Lobular Neoplasms | 7870 |
None | 60775 |
Plasma Cell Tumors | 1066 |
Lymphoid Leukemias | 2072 |
Myeloid Leukemias | 3965 |
Nevi and Melanomas | 3155 |
Pancreatic Ductal Adenocarcinoma | 144 |
Neuroepitheliomatous Neoplasms | 1331 |
Myomatous Neoplasms | 449 |
Squamous Cell Neoplasms | 4467 |
Epithelial Neoplasms, NOS | 5668 |
Cystic, Mucinous and Serous Neoplasms | 3375 |
Transitional Cell Papillomas and Carcinomas | 1885 |
Miscellaneous Tumors | 89 |
Gliomas | 4772 |
Mesothelial Neoplasms | 647 |
Glioblastoma | 100 |
Fibromatous Neoplasms | 322 |
Meningiomas | 289 |
Thymic Epithelial Neoplasms | 262 |
Clear Cell Renal Cell Carcinoma | 116 |
Complex Mixed and Stromal Neoplasms | 1532 |
Mature B-Cell Lymphomas | 1019 |
Germ Cell Neoplasms | 703 |
Lung Squamous Cell Carcinoma | 118 |
Lung Adenocarcinoma | 216 |
Colon Adenocarcinoma | 164 |
Nerve Sheath Tumors | 115 |
Complex Epithelial Neoplasms | 227 |
Lipomatous Neoplasms | 343 |
Breast Invasive Carcinoma | 251 |
Osseous and Chondromatous Neoplasms | 615 |
Oral Squamous Cell Carcinoma | 38 |
Paragangliomas and Glomus Tumors | 241 |
Ovarian Serous Cystadenocarcinoma | 283 |
Soft Tissue Tumors and Sarcomas, NOS | 301 |
Hepatocellular Carcinoma | 170 |
Acinar Cell Neoplasms | 300 |
Myelodysplastic Syndromes | 386 |
Not Applicable | 440 |
Rectum Adenocarcinoma | 30 |
Leukemias, NOS | 118 |
Pediatric/AYA Brain Tumors | 199 |
Adnexal and Skin Appendage Neoplasms | 58 |
Not Reported | 259 |
Head and Neck Squamous Cell Carcinoma | 110 |
Early Onset Gastric Cancer | 80 |
Blood Vessel Tumors | 156 |
Chronic Myeloproliferative Disorders | 476 |
Neoplasms, NOS | 1239 |
Miscellaneous Bone Tumors | 130 |
Basal Cell Neoplasms | 45 |
Specialized Gonadal Neoplasms | 124 |
Unknown | 63 |
Synovial-like Neoplasms | 98 |
Neoplasms of Histiocytes and Accessory Lymphoid Cells | 66 |
Mucoepidermoid Neoplasms | 60 |
Malignant Lymphomas, NOS or Diffuse | 42 |
Other Leukemias | 68 |
Fibroepithelial Neoplasms | 25 |
Mature T- and NK-Cell Lymphomas | 94 |
Other Hematologic Disorders | 20 |
Granular Cell Tumors and Alveolar Soft Part Sarcomas | 23 |
Hodgkin Lymphoma | 11 |
Myxomatous Neoplasms | 18 |
Acute Myeloid Leukemia | 2 |
Chromophobe Renal Cell Carcinoma | 1 |
Trophoblastic neoplasms | 8 |
Mast Cell Tumors | 10 |
Giant Cell Tumors | 3 |
Immunoproliferative Diseases | 4 |
Odontogenic Tumors | 3 |
Precursor Cell Lymphoblastic Lymphoma | 12 |
Lymphatic Vessel Tumors | 1 |
Papillary Renal Cell Carcinoma | 2 |
primary_diagnosis_site | count |
---|---|
Ovary | 4346 |
Breast | 21945 |
Hematopoietic and reticuloendothelial systems | 9007 |
Adrenal gland | 851 |
Eye and adnexa | 222 |
Brain | 2923 |
Retroperitoneum and peritoneum | 384 |
Liver and intrahepatic bile ducts | 1609 |
Corpus uteri | 780 |
Bronchus and lung | 12256 |
Kidney | 4788 |
Spinal cord, cranial nerves, and other parts of central nervous system | 3703 |
Heart, mediastinum, and pleura | 706 |
Head-Neck | 2704 |
Prostate gland | 2354 |
Colon | 8559 |
Other and unspecified parts of mouth | 43 |
Other and unspecified female genital organs | 160 |
Larynx | 169 |
Not Reported | 506 |
Rectosigmoid junction | 81 |
Pancreas | 3352 |
Peripheral nerves and autonomic nervous system | 418 |
Thyroid gland | 1880 |
Esophagus | 1580 |
Chest | 28221 |
Lung | 4848 |
Connective, subcutaneous and other soft tissues | 1573 |
Unknown | 3232 |
Thymus | 431 |
Bladder | 2155 |
Head and Neck | 148 |
Skin | 3497 |
Tonsil | 46 |
Testis | 542 |
Other and ill-defined digestive organs | 719 |
Floor of mouth | 56 |
Abdomen | 92 |
Other and ill-defined sites | 1186 |
Stomach | 1870 |
Rectum | 1308 |
Other and unspecified parts of tongue | 133 |
Palate | 5 |
Bones, joints and articular cartilage of other and unspecified sites | 455 |
Bones, joints and articular cartilage of limbs | 268 |
Base of tongue | 24 |
Meninges | 243 |
Lymph nodes | 538 |
Liver | 622 |
Abdomen, Mediastinum | 176 |
Various (11 locations) | 89 |
Marrow, Blood | 89 |
Head | 105 |
Prostate | 2139 |
Lung Phantom | 8 |
Various | 449 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Testicles | 150 |
Thyroid | 507 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Adrenal Glands | 271 |
Bile Duct | 51 |
Ear | 242 |
Other and ill-defined sites in lip, oral cavity and pharynx | 361 |
Nasopharynx | 101 |
Other and unspecified major salivary glands | 615 |
Anus and anal canal | 233 |
Other and unspecified parts of biliary tract | 224 |
Other endocrine glands and related structures | 181 |
Oropharynx | 194 |
Small intestine | 269 |
Gum | 11 |
Mesothelium | 87 |
None | 340 |
Gallbladder | 265 |
Other and unspecified urinary organs | 217 |
Intraocular | 80 |
Vagina | 72 |
Ureter | 15 |
Penis | 33 |
Hypopharynx | 25 |
Trachea | 7 |
Nasal cavity and middle ear | 40 |
Lip | 9 |
Other and unspecified male genital organs | 1 |
Vulva | 10 |
Pancreas | 1 |
Renal pelvis | 1 |
Other and ill-defined sites within respiratory system and intrathoracic organs | 2 |
%
pattern matching¶
While OR
is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']
The %
operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:
Q('ResearchSubject.primary_diagnosis_site = "uter%"').researchsubject.count.run()
Getting results from database
Total execution time: 3471 ms
total : 2865
files : 257140
system | count |
---|---|
GDC | 1894 |
PDC | 104 |
IDC | 867 |
primary_diagnosis_condition | count |
---|---|
Adenomas and Adenocarcinomas | 1037 |
Uterine Corpus Endometrial Carcinoma | 104 |
Not Reported | 12 |
Complex Mixed and Stromal Neoplasms | 294 |
Myomatous Neoplasms | 183 |
None | 867 |
Cystic, Mucinous and Serous Neoplasms | 313 |
Trophoblastic neoplasms | 13 |
Epithelial Neoplasms, NOS | 20 |
Mesonephromas | 2 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Complex Epithelial Neoplasms | 2 |
Neoplasms, NOS | 3 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Uterus, NOS | 1998 |
Uterus | 867 |
Because the %
is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the %
, or add more of them:
Q('ResearchSubject.primary_diagnosis_site = "%uter"').researchsubject.count.run()
Getting results from database
Total execution time: 3495 ms
total : 0
files : 0
Q('ResearchSubject.primary_diagnosis_site = "%uter%"').researchsubject.count.run()
Getting results from database
Total execution time: 3465 ms
total : 4560
files : 302149
system | count |
---|---|
GDC | 3589 |
PDC | 104 |
IDC | 867 |
primary_diagnosis_condition | count |
---|---|
Adenomas and Adenocarcinomas | 1671 |
Cystic, Mucinous and Serous Neoplasms | 487 |
Squamous Cell Neoplasms | 609 |
Uterine Corpus Endometrial Carcinoma | 104 |
Myomatous Neoplasms | 187 |
Complex Mixed and Stromal Neoplasms | 320 |
None | 868 |
Complex Epithelial Neoplasms | 27 |
Epithelial Neoplasms, NOS | 230 |
Not Reported | 12 |
Trophoblastic neoplasms | 13 |
Neoplasms, NOS | 12 |
Soft Tissue Tumors and Sarcomas, NOS | 14 |
Mesonephromas | 5 |
Neuroepitheliomatous Neoplasms | 1 |
primary_diagnosis_site | count |
---|---|
Corpus uteri | 780 |
Cervix uteri | 915 |
Uterus, NOS | 1998 |
Uterus | 867 |
There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine %
with !=
:
Q('sex != "f%"').subject.count.run()
Getting results from database
Total execution time: 3538 ms
total : 91361
files : 37815169
system | count |
---|---|
IDC | 56038 |
GDC | 40024 |
PDC | 1089 |
sex | count |
---|---|
not reported | 266 |
None | 51216 |
male | 39793 |
unknown | 81 |
unspecified | 5 |
race | count |
---|---|
None | 51216 |
white | 23406 |
chinese | 65 |
asian | 1348 |
black or african american | 1815 |
not reported | 9881 |
not allowed to collect | 1106 |
other | 415 |
Unknown | 2027 |
american indian or alaska native | 56 |
native hawaiian or other pacific islander | 26 |
ethnicity | count |
---|---|
not reported | 11796 |
None | 51216 |
not hispanic or latino | 23020 |
Unknown | 2293 |
hispanic or latino | 1450 |
not allowed to collect | 1586 |
cause_of_death | count |
---|---|
None | 90714 |
Not Reported | 335 |
Metastasis | 1 |
Infection | 3 |
Cancer Related | 198 |
Unknown | 22 |
Not Cancer Related | 76 |
HCC recurrence | 5 |
Surgical Complications | 3 |
Cardiovascular Disorder, NOS | 3 |
Cerebral Hemorrhage | 1 |
IS and IS NOT¶
In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually null
, that is, the field is empty. In order to search for emptiness, you need to use the special function IS
:
Q('ResearchSubject.primary_diagnosis_condition IS null').researchsubject.count.run()
Getting results from database
Total execution time: 3468 ms
total : 61950
files : 38475829
system | count |
---|---|
IDC | 61948 |
GDC | 2 |
primary_diagnosis_condition | count |
---|---|
None | 61950 |
primary_diagnosis_site | count |
---|---|
Breast | 12587 |
Head-Neck | 2704 |
Colon | 1491 |
Chest | 28221 |
Lung | 4728 |
Abdomen | 92 |
Various | 449 |
Brain | 1165 |
Abdomen, Mediastinum | 176 |
Thymus | 125 |
Bladder | 431 |
Kidney | 1373 |
Uterus | 867 |
Various (11 locations) | 89 |
Pancreas | 481 |
Skin | 612 |
Head | 105 |
Liver | 452 |
Prostate | 2139 |
Pelvis, Prostate, Anus | 58 |
Phantom | 33 |
Extremities | 51 |
Ovary | 664 |
Testicles | 150 |
Bile Duct | 51 |
Rectum | 171 |
Thyroid | 507 |
Stomach | 443 |
Chest-Abdomen-Pelvis, Leg, TSpine | 261 |
Mesothelium | 87 |
Adrenal Glands | 271 |
Esophagus | 187 |
Cervix | 307 |
Ear | 242 |
Marrow, Blood | 89 |
Intraocular | 80 |
Cervix uteri | 1 |
Lung Phantom | 8 |
Pancreas | 1 |
None | 1 |
Probably more common, is to want to filter out the empty fields, in which case you use its companion function IS_NOT
:
Q('sex IS NOT null').subject.count.run()
Getting results from database
Total execution time: 3546 ms
total : 85654
files : 3487816
system | count |
---|---|
GDC | 84979 |
IDC | 11004 |
PDC | 2231 |
sex | count |
---|---|
female | 45509 |
not reported | 266 |
male | 39793 |
unknown | 81 |
unspecified | 5 |
race | count |
---|---|
black or african american | 4567 |
white | 49069 |
not reported | 21816 |
chinese | 90 |
asian | 2951 |
Unknown | 3985 |
native hawaiian or other pacific islander | 55 |
other | 947 |
american indian or alaska native | 116 |
not allowed to collect | 2058 |
ethnicity | count |
---|---|
not hispanic or latino | 48382 |
not reported | 26034 |
Unknown | 4455 |
hispanic or latino | 3131 |
not allowed to collect | 3652 |
cause_of_death | count |
---|---|
Not Reported | 797 |
None | 84257 |
HCC recurrence | 7 |
Cancer Related | 336 |
Unknown | 131 |
Infection | 7 |
Not Cancer Related | 107 |
Surgical Complications | 4 |
Cardiovascular Disorder, NOS | 4 |
Cancer cell proliferation | 1 |
Metastasis | 2 |
Cerebral Hemorrhage | 1 |
Greater and Less than¶
While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:
>
: Greater than<
: Less than>=
: Greater than or Equal to<=
: Less than or Equal to
These can all be used in place of the =
sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, days_to_birth
is reported as a negative number:
Q('days_to_birth < 50*-365 ').subject.run().to_dataframe()
Getting results from database
Total execution time: 3562 ms
id | identifier | species | sex | race | ethnicity | days_to_birth | subject_associated_project | vital_status | age_at_death | cause_of_death | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | C3L-02170 | [{'system': 'GDC', 'value': 'C3L-02170'}, {'sy... | Homo sapiens | male | not reported | not reported | -26717 | [CPTAC3-Discovery, CPTAC-3, cptac_lscc] | Alive | NaN | None |
1 | C3L-02704 | [{'system': 'GDC', 'value': 'C3L-02704'}, {'sy... | Homo sapiens | male | white | not hispanic or latino | -19879 | [cptac_gbm, CPTAC3-Discovery, CPTAC-3] | Alive | NaN | None |
2 | C3N-01214 | [{'system': 'GDC', 'value': 'C3N-01214'}, {'sy... | Homo sapiens | male | not reported | not reported | -22241 | [CPTAC3-Discovery, CPTAC-3, cptac_ccrcc] | Alive | NaN | Not Reported |
3 | C3N-01998 | [{'system': 'GDC', 'value': 'C3N-01998'}, {'sy... | Homo sapiens | male | not reported | not reported | -25293 | [cptac_pda, CPTAC3-Discovery, CPTAC-3] | Dead | 186.0 | None |
4 | C3N-02339 | [{'system': 'GDC', 'value': 'C3N-02339'}, {'sy... | Homo sapiens | male | not reported | not reported | -28848 | [CPTAC3-Discovery, CPTAC-3, cptac_lscc] | Alive | NaN | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | GENIE-DFCI-027361 | [{'system': 'GDC', 'value': 'GENIE-DFCI-027361'}] | Homo sapiens | male | white | not hispanic or latino | -20819 | [GENIE-DFCI] | Not Reported | NaN | None |
96 | GENIE-DFCI-033908 | [{'system': 'GDC', 'value': 'GENIE-DFCI-033908'}] | Homo sapiens | female | white | not hispanic or latino | -19723 | [GENIE-DFCI] | Not Reported | NaN | None |
97 | GENIE-DFCI-035602 | [{'system': 'GDC', 'value': 'GENIE-DFCI-035602'}] | Homo sapiens | female | white | not hispanic or latino | -19358 | [GENIE-DFCI] | Not Reported | NaN | None |
98 | GENIE-DFCI-035797 | [{'system': 'GDC', 'value': 'GENIE-DFCI-035797'}] | Homo sapiens | male | white | not hispanic or latino | -20088 | [GENIE-DFCI] | Not Reported | NaN | None |
99 | GENIE-DFCI-036506 | [{'system': 'GDC', 'value': 'GENIE-DFCI-036506'}] | Homo sapiens | male | asian | not hispanic or latino | -19723 | [GENIE-DFCI] | Not Reported | NaN | None |
100 rows × 11 columns