Home > UNDERSTAND THE DATA (2nd Challenge)



Please note that this is the legacy page of the 2nd Challenge, which concluded in Feb 2024.


Study Outline


To study the long-term effects of priming between the acellular pertussis (aP) vs. whole-cellular pertussis (wP) vaccines, we have recruited individuals born in the US prior to 1995 and those born after. Those born before 1995 have been primed with the wP vaccine in infancy while those individuals born later have been primed with the aP vaccine. The recruited individuals are those now eligible for Tdap booster vaccinations which contain tetanus toxoid (T) diphtheria toxoid (d), and acellular pertussis (ap; FHA, Fim2/3, PRN, and pertussis toxin) antigens.


Study specimens will be collected longitudinally from blood at pre- (day 0) and post-vaccination (days 1, 3, 7, and 14) time points. In total, approximately 5000+ specimens will be collected and analyzed during the course of five years. For each challenge, we aim to have 16 aP donors and 16 wP donors. This data would allow us to understand how the immune system responds to pertussis antigenic challenge in vivo as a proxy of infectious challenge and whether this response differs in aP vs. wP-primed individuals years after the original vaccination.


Screen Shot 2023-09-21 at 10.57.27 AM.png

Figure 7: Study timeline


Sample and data collection


In this second phase of the CMI-PB project, we present longitudinal data generated from 500+ blood specimens provided by 118 subjects using four assays: PBMC Cell frequency, PBMC Gene expression, Plasma cytokine concentrations, and Plasma antibody titers.

PBMCs (peripheral blood mononuclear cells) are a heterogeneous mixture of different cell types that vary substantially in frequency between individuals even in the absence of immune perturbations. Differences in cell type frequencies in PBMCs across individuals are expected to impact gene expression patterns.


More on the data:

  • We systematically analyzed variations in cell type frequencies across different time points for N=73 subjects.
  • We performed gene expression analysis on PBMC samples collected longitudinally at baseline and following booster vaccination. After removing samples that did not pass quality controls, complete-time courses were obtained for N = 92 subjects.
  • To determine how Tdap vaccination impacts proteins secreted in the blood, we measured plasma cytokine concentrations to uncover vaccination-related perturbations of 45 unique plasma protein markers. Plasma samples (subjects, N = 74) were profiled and evaluated for expression of proteins using a highly sensitive and quantitative proximity extension assay (PEA).
  • We then profiled the plasma levels at baseline up to 3 months after the boost of antigen-specific IgG, all IgG subclasses (IgG1-4), and IgE.


Outline of recruitment and study design

Figure 8: Study outline


This data has worked for designing the database schema, testing data models, and demonstrating visualization tools that will be available for public access through a web interface. Furthermore, this data has been used for the second prediction challenge, where datasets have been split into training and test datasets.


Our approach to data standardization


Standardization and nomenclature are key aspects of the CMI-PB project to allow for a consistent understanding of the data available. Here, the central database, discussed below, will play an important role by unifying existing scientific nomenclature and classification with a data structure suitable for analysis. For example, each macromolecular entry in experimental data will be linked to its standard nomenclatures and classification resources as well as via CMI-PB terminology.


Approach for the CMI-PB central database schema


CMI-PB central database is implemented as a Relational Database Management System (RDBMS), and as such, consists primarily of a collection of table definitions, each of which corresponds to a high-level category of entity. Above these tables, the CMI-PB central database is divided into three different categories, where each category features distinct domains of the data (Figure 9). These table categories include information, experimental data and metadata/ontology tables. These categories represent the top level in the data hierarchy used to store different sets of information into the CMI-PB database.


CMI-PB Database schema

Figure 9: CMI-PB central database schema


A) Information tables:

  • subject: Each row in the subject table represents an individual given a unique identification number (column name: subject_id). Other fields provide information about infancy vaccination status, TDaP (tetanus, diphtheria, and acellular pertussis) vaccination booster date, and subject-specific information such as date of birth, biological sex, ethnicity, race, etc.
  • specimen: Each row in the subject table represents a clinical sample e.g., blood (specimen_id) for a subject (subject_id) at the particular visit. Additionally, information about the Nth visit, sample type, e.g., PBMC and days count relative to the boost is provided.

B) Experimental data tables:

These tables include data from mass cytometry, gene expression, antibody titer, and protein expression experiments. Each experimental data table has experiment-specific information.

  • PBMC_cell_frequency: Each row in the PBMC_cell_frequency table represents a specimen (specimen_id) along with the names and values for measured analytes, such as cell population name and percent live cell.
  • plasma_gene_expression: Each row in the plasma_gene_expression table represents a specimen (specimen_id) along with the names and values for measured analytes such as gene (versioned_ensembl_gene_id) and tpm and raw count for each gene.
  • plasma_ab_titer: Each row in the plasma_ab_titer table represents a specimen (specimen_id) along with the names and values for measured analytes such as antigen and IgG isotype with corresponding antibody titer. Antibody titer is represented by Mean Fluorescence Intensity (MFI) units.
  • plasma_cytokine_concentrations: Each row in the plasma_cytokine_concentrations table represents a specimen (specimen_id) along with the names and values for measured analytes such as the Olink protein identifier with corresponding protein expression values (cytokine concentration, pg/ml) and quality control values.

C) Metadata/ontology tables:

This category includes metadata about genes, protein identifiers, and supplementary information to experimental data tables. For instance, cell populations are identified using the gating definition, and it is redundant to keep this information within the PBMC_cell_frequency table. Therefore, a separate metadata table is created and linked.

  • gene, transcript, and protein: These three metadata tables include gene and protein mapping information available in public databases such as Ensembl, Entrez, Uniprot, etc. Herein, gene, transcript, and protein ids are mapped to one another.
  • cell_type: It provides additional information about each cell population with their corresponding gating scheme used.

Browse our terminology


CMI-PB terminology browser is a system made for creating, maintaining, and sharing an application ontology that builds on multiple reference resources. It will serve as a reference resource to annotate and systematically explain the terms used for the study design, experimental data, and metadata. We intend for the CMI-PB terminology browser to be general (as it has the potential to be used for other projects), but the primary goal is to serve this CMI-PB project. Users can search, browse, and query the CMI-PB Terminology browser using a uniform web-based interface and API.




Last updated: Aug. 2, 2024, 1:47 a.m.