Data¶
The data
module gives access to a set of publicly available WSIs, stained with different techniques (H&E and IHC). In particular, slides in the data
module are retrieved from the following repositories:
The Cancer Genome Atlas (TCGA): as detailed in the methods docstring, for each WSI, we access the URL pointing to the corresponding location within the portal, e.g. https://portal.gdc.cancer.gov/files/9c960533-2e58-4e54-97b2-8454dfb4b8c8, to retrieve the WSI;
OpenSlide, a repository of freely-distributed test slides from different scanner vendors;
Image Data Resource (IDR): the WSIs are selected from the data collection provided by Schaadt et al. 1 and available at IDR under the accession number idr0073.
Note
We use Pooch under the hood, which is an optional requirement for histolab
and needs to be installed separately with:
pip install pooch
Tissue |
Dimensions (wxh) |
Size (MB) |
Repository |
Staining |
---|---|---|---|---|
15374x17497 |
63.8 |
OpenSlide |
H&E |
|
2220x2967 |
1.8 |
OpenSlide |
H&E |
|
96972x30682 |
299.1 |
TCGA-BRCA |
H&E |
|
121856x94697 |
1740.8 |
TCGA-BRCA |
H&E |
|
98874x64427 |
719.6 |
TCGA-BRCA |
H&E |
|
60928x75840 |
510.9 |
TCGA-BRCA |
H&E |
|
99606x7121 |
218.3 |
IDR |
IHC |
|
32672x47076 |
289.3 |
OpenSlide |
H&E |
|
5179x4192 |
66.1 |
IDR |
IHC |
|
30001x33987 |
389.1 |
TCGA-OV |
H&E |
|
16000x15316 |
46.1 |
TCGA-PRAD |
H&E |
TCGA-BRCA: TCGA Breast Invasive Carcinoma dataset; TCGA-PRAD: TCGA Prostate Adenocarcinoma dataset; TCGA-OV: Ovarian Serous Cystadenocarcinoma dataset.

- aorta_tissue() Tuple[openslide.OpenSlide, str] [source]¶
Aorta tissue, brightfield, JPEG 2000, YCbCr
This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/
Free to use and distribute, with or without modification
- Returns
aorta_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of aortic tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- breast_tissue() Tuple[openslide.OpenSlide, str] [source]¶
Breast tissue, TCGA-BRCA dataset.
This image is available here https://portal.gdc.cancer.gov/files/ad9ed74a-2725-49e6-bf7a-ef100e299989 or through the API https://api.gdc.cancer.gov/data/ad9ed74a-2725-49e6-bf7a-ef100e299989
It corresponds to TCGA file TCGA-A8-A082-01A-01-TS1.3cad4a77-47a6-4658-becf-d8cffa161d3a.svs
Access: open
- Returns
breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- breast_tissue_diagnostic_black_pen() Tuple[openslide.OpenSlide, str] [source]¶
Breast tissue, TCGA-BRCA dataset. Diagnostic slide with black pen marks.
This image is available here https://portal.gdc.cancer.gov/files/e70c89a5-1c2f-43f8-b6be-589beea55338 or through the API https://api.gdc.cancer.gov/data/e70c89a5-1c2f-43f8-b6be-589beea55338
It corresponds to TCGA file TCGA-BH-A201-01Z-00-DX1.6D6E3224-50A0-45A2-B231-EEF27CA7EFD2.svs
Access: open
- Returns
breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with green black marks.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- breast_tissue_diagnostic_green_pen() Tuple[openslide.OpenSlide, str] [source]¶
Breast tissue, TCGA-BRCA dataset. Diagnostic slide with green pen marks.
This image is available here https://portal.gdc.cancer.gov/files/3845b8bd-cbe0-49cf-a418-a8120f6c23db or through the API https://api.gdc.cancer.gov/data/3845b8bd-cbe0-49cf-a418-a8120f6c23db
It corresponds to TCGA file TCGA-A1-A0SH-01Z-00-DX1.90E71B08-E1D9-4FC2-85AC-062E56DDF17C.svs
Access: open
- Returns
breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with green pen marks.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- breast_tissue_diagnostic_red_pen() Tuple[openslide.OpenSlide, str] [source]¶
Breast tissue, TCGA-BRCA dataset. Diagnostic slide with red pen marks.
This image is available here https://portal.gdc.cancer.gov/files/682e4d74-2200-4f34-9e96-8dee968b1568 or through the API https://api.gdc.cancer.gov/data/682e4d74-2200-4f34-9e96-8dee968b1568
It corresponds to TCGA file TCGA-E9-A24A-01Z-00-DX1.F0342837-5750-4172-B60D-5F902E2A02FD.svs
Access: open
- Returns
breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with red pen marks.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- cmu_small_region() Tuple[openslide.OpenSlide, str] [source]¶
Carnegie Mellon University MRXS sample tissue
This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/
Licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
- Returns
cmu_mrxs_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of small tissue region.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- file_hash(fname, alg='sha256')[source]¶
Calculate the hash of a given file.
Useful for checking if a file has changed or been corrupted.
- Parameters
fname (str) – The name of the file.
alg (str) – The type of the hashing algorithm
- Returns
hash – The hash of the file.
- Return type
str
Examples
>>> fname = "test-file-for-hash.txt" >>> with open(fname, "w") as f: ... __ = f.write("content of the file") >>> print(file_hash(fname)) 0fc74468e6a9a829f103d069aeb2bb4f8646bad58bf146bb0e3379b759ec4a00 >>> import os >>> os.remove(fname)
- heart_tissue() Tuple[openslide.OpenSlide, str] [source]¶
Heart tissue, brightfield, JPEG 2000, YCbCr
This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/
Free to use and distribute, with or without modification
- Returns
heart_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of heart tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- ihc_breast() Tuple[openslide.OpenSlide, str] [source]¶
Breast cancer resection, staining CD3 (brown) and CD20 (red).
This image is available here https://idr.openmicroscopy.org/ under accession number idr0073, ID breastCancer12.
- Returns
ihc_breast (openslide.OpenSlide) – IHC-stained Whole-Slide-Image of Breast tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- ihc_kidney() Tuple[openslide.OpenSlide, str] [source]¶
Renal allograft, staining CD3 (brown) and CD20 (red).
This image is available here https://idr.openmicroscopy.org/ under accession number idr0073, ID kidney_46_4.
- Returns
ihc_kidney (openslide.OpenSlide) – IHC-stained Whole-Slide-Image of kidney tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- ovarian_tissue() Tuple[openslide.OpenSlide, str] [source]¶
tissue of Ovarian Serous Cystadenocarcinoma, TCGA-OV dataset.
This image is available here https://portal.gdc.cancer.gov/files/e968375e-ef58-4607-b457-e6818b2e8431 or through the API https://api.gdc.cancer.gov/data/e968375e-ef58-4607-b457-e6818b2e8431
It corresponds to TCGA file CGA-13-1404-01A-01-TS1.cecf7044-1d29-4d14-b137-821f8d48881e.svs
Access: open
- Returns
prostate_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of ovarian tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
- prostate_tissue() Tuple[openslide.OpenSlide, str] [source]¶
tissue of Prostate Adenocarcinoma, TCGA-PRAD dataset.
This image is available here https://portal.gdc.cancer.gov/files/5a8ce04a-0178-49e2-904c-30e21fb4e41e or through the API https://api.gdc.cancer.gov/data/5a8ce04a-0178-49e2-904c-30e21fb4e41e
It corresponds to TCGA file TCGA-CH-5753-01A-01-BS1.4311c533-f9c1-4c6f-8b10-922daa3c2e3e.svs
Access: open
- Returns
prostate_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of prostate tissue.
path (str) – Path where the slide is saved
- Return type
Tuple[openslide.OpenSlide, str]
References¶
- 1
Schaadt NS, Schönmeyer R, Forestier G, et al. “Graph-based description of tertiary lymphoid organs at single-cell level.” PLoS Comput Biol. (2020)