Data

The data module gives access to a set of publicly available WSIs, stained with different techniques (H&E and IHC). In particular, slides in the data module are retrieved from the following repositories:

  • The Cancer Genome Atlas (TCGA): as detailed in the methods docstring, for each WSI, we access the URL pointing to the corresponding location within the portal, e.g. https://portal.gdc.cancer.gov/files/9c960533-2e58-4e54-97b2-8454dfb4b8c8, to retrieve the WSI;

  • OpenSlide, a repository of freely-distributed test slides from different scanner vendors;

  • Image Data Resource (IDR): the WSIs are selected from the data collection provided by Schaadt et al. 1 and available at IDR under the accession number idr0073.

Note

We use Pooch under the hood, which is an optional requirement for histolab and needs to be installed separately with:

pip install pooch
Set of downloadable WSIs.

Tissue

Dimensions (wxh)

Size (MB)

Repository

Staining

Aorta

15374x17497

63.8

OpenSlide

H&E

CMU small sample

2220x2967

1.8

OpenSlide

H&E

Breast

96972x30682

299.1

TCGA-BRCA

H&E

Breast (black pen)

121856x94697

1740.8

TCGA-BRCA

H&E

Breast (green pen)

98874x64427

719.6

TCGA-BRCA

H&E

Breast (red pen)

60928x75840

510.9

TCGA-BRCA

H&E

Breast (IHC)

99606x7121

218.3

IDR

IHC

Heart

32672x47076

289.3

OpenSlide

H&E

Kidney

5179x4192

66.1

IDR

IHC

Ovary

30001x33987

389.1

TCGA-OV

H&E

Prostate

16000x15316

46.1

TCGA-PRAD

H&E

TCGA-BRCA: TCGA Breast Invasive Carcinoma dataset; TCGA-PRAD: TCGA Prostate Adenocarcinoma dataset; TCGA-OV: Ovarian Serous Cystadenocarcinoma dataset.

Thumbnails of avaliable WSIs
aorta_tissue() Tuple[openslide.OpenSlide, str][source]

Aorta tissue, brightfield, JPEG 2000, YCbCr

This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/

Free to use and distribute, with or without modification

Returns

  • aorta_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of aortic tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

breast_tissue() Tuple[openslide.OpenSlide, str][source]

Breast tissue, TCGA-BRCA dataset.

This image is available here https://portal.gdc.cancer.gov/files/ad9ed74a-2725-49e6-bf7a-ef100e299989 or through the API https://api.gdc.cancer.gov/data/ad9ed74a-2725-49e6-bf7a-ef100e299989

It corresponds to TCGA file TCGA-A8-A082-01A-01-TS1.3cad4a77-47a6-4658-becf-d8cffa161d3a.svs

Access: open

Returns

  • breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

breast_tissue_diagnostic_black_pen() Tuple[openslide.OpenSlide, str][source]

Breast tissue, TCGA-BRCA dataset. Diagnostic slide with black pen marks.

This image is available here https://portal.gdc.cancer.gov/files/e70c89a5-1c2f-43f8-b6be-589beea55338 or through the API https://api.gdc.cancer.gov/data/e70c89a5-1c2f-43f8-b6be-589beea55338

It corresponds to TCGA file TCGA-BH-A201-01Z-00-DX1.6D6E3224-50A0-45A2-B231-EEF27CA7EFD2.svs

Access: open

Returns

  • breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with green black marks.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

breast_tissue_diagnostic_green_pen() Tuple[openslide.OpenSlide, str][source]

Breast tissue, TCGA-BRCA dataset. Diagnostic slide with green pen marks.

This image is available here https://portal.gdc.cancer.gov/files/3845b8bd-cbe0-49cf-a418-a8120f6c23db or through the API https://api.gdc.cancer.gov/data/3845b8bd-cbe0-49cf-a418-a8120f6c23db

It corresponds to TCGA file TCGA-A1-A0SH-01Z-00-DX1.90E71B08-E1D9-4FC2-85AC-062E56DDF17C.svs

Access: open

Returns

  • breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with green pen marks.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

breast_tissue_diagnostic_red_pen() Tuple[openslide.OpenSlide, str][source]

Breast tissue, TCGA-BRCA dataset. Diagnostic slide with red pen marks.

This image is available here https://portal.gdc.cancer.gov/files/682e4d74-2200-4f34-9e96-8dee968b1568 or through the API https://api.gdc.cancer.gov/data/682e4d74-2200-4f34-9e96-8dee968b1568

It corresponds to TCGA file TCGA-E9-A24A-01Z-00-DX1.F0342837-5750-4172-B60D-5F902E2A02FD.svs

Access: open

Returns

  • breast_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of breast tissue with red pen marks.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

cmu_small_region() Tuple[openslide.OpenSlide, str][source]

Carnegie Mellon University MRXS sample tissue

This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/

Licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

Returns

  • cmu_mrxs_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of small tissue region.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

file_hash(fname, alg='sha256')[source]

Calculate the hash of a given file.

Useful for checking if a file has changed or been corrupted.

Parameters
  • fname (str) – The name of the file.

  • alg (str) – The type of the hashing algorithm

Returns

hash – The hash of the file.

Return type

str

Examples

>>> fname = "test-file-for-hash.txt"
>>> with open(fname, "w") as f:
...     __ = f.write("content of the file")
>>> print(file_hash(fname))
0fc74468e6a9a829f103d069aeb2bb4f8646bad58bf146bb0e3379b759ec4a00
>>> import os
>>> os.remove(fname)
heart_tissue() Tuple[openslide.OpenSlide, str][source]

Heart tissue, brightfield, JPEG 2000, YCbCr

This image is available here http://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/

Free to use and distribute, with or without modification

Returns

  • heart_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of heart tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

ihc_breast() Tuple[openslide.OpenSlide, str][source]

Breast cancer resection, staining CD3 (brown) and CD20 (red).

This image is available here https://idr.openmicroscopy.org/ under accession number idr0073, ID breastCancer12.

Returns

  • ihc_breast (openslide.OpenSlide) – IHC-stained Whole-Slide-Image of Breast tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

ihc_kidney() Tuple[openslide.OpenSlide, str][source]

Renal allograft, staining CD3 (brown) and CD20 (red).

This image is available here https://idr.openmicroscopy.org/ under accession number idr0073, ID kidney_46_4.

Returns

  • ihc_kidney (openslide.OpenSlide) – IHC-stained Whole-Slide-Image of kidney tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

ovarian_tissue() Tuple[openslide.OpenSlide, str][source]

tissue of Ovarian Serous Cystadenocarcinoma, TCGA-OV dataset.

This image is available here https://portal.gdc.cancer.gov/files/e968375e-ef58-4607-b457-e6818b2e8431 or through the API https://api.gdc.cancer.gov/data/e968375e-ef58-4607-b457-e6818b2e8431

It corresponds to TCGA file CGA-13-1404-01A-01-TS1.cecf7044-1d29-4d14-b137-821f8d48881e.svs

Access: open

Returns

  • prostate_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of ovarian tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

prostate_tissue() Tuple[openslide.OpenSlide, str][source]

tissue of Prostate Adenocarcinoma, TCGA-PRAD dataset.

This image is available here https://portal.gdc.cancer.gov/files/5a8ce04a-0178-49e2-904c-30e21fb4e41e or through the API https://api.gdc.cancer.gov/data/5a8ce04a-0178-49e2-904c-30e21fb4e41e

It corresponds to TCGA file TCGA-CH-5753-01A-01-BS1.4311c533-f9c1-4c6f-8b10-922daa3c2e3e.svs

Access: open

Returns

  • prostate_tissue (openslide.OpenSlide) – H&E-stained Whole-Slide-Image of prostate tissue.

  • path (str) – Path where the slide is saved

Return type

Tuple[openslide.OpenSlide, str]

References

1

Schaadt NS, Schönmeyer R, Forestier G, et al. “Graph-based description of tertiary lymphoid organs at single-cell level.” PLoS Comput Biol. (2020)