ProteinWeaver Data Log & Version

This section of the documentation outlines the data sources, processing steps and versions of the ProteinWeaver web interface.

Drosophila melanogaster Data Sources (TXID7227)

2023-09-29 (BETA):

Interaction data:

GO association data:

2024-03-18:

GO association data:

FlyBase IDs from UniProt IDs for mapping:

  • idmapping_2024_03_18.tsv (Source)
  • Downloaded from UniProt and merged with GO data from QuickGO to match the FlyBase naming convention. Renamed columns to "GENE_PRODUCT_ID" and "FB_ID" and merged in scripts/SubColNames.R.

2024-04-01:

  • Added 415,493 inferred ProGo edges using a Cypher command.

2024-04-03:

GO association data:

2024-07-30:

  • Added PubMed identifiers as a property to ProPro edges.

2024-07-31:

Regulatory data:

Current D. melanogaster Network [Updated 2024-08-28]

| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 12823       | 7905             | 1322             | 3596           | 233054                | 17530              |

| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 11774    | 492331            | 98799                | 393532                 |

Bacillus subtilis Data Sources (TXID224308)

2023-10-18 (BETA):

Interaction data:

GO association data:

wget 'https://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=source,bioentity_internal_id,bioentity_label,qualifier,annotation_class,reference,evidence_type,evidence_with,aspect,bioentity_name,synonym,type,taxon,date,assigned_by,annotation_extension_class,bioentity_isoform&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22annotation%22&fq=taxon_subset_closure_label:%22Bacillus%20subtilis%20subsp.%20subtilis%20str.%20168%22&facet.field=aspect&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=evidence_subset_closure_label&facet.field=regulates_closure_label&facet.field=isa_partof_closure_label&facet.field=annotation_class_label&facet.field=qualifier&facet.field=annotation_extension_class_closure_label&facet.field=assigned_by&facet.field=panther_family_label&q=*:*'
  • File was renamed to bsub_go_uniprot.tsv, processed and merged into bsub_GO_data.csv according to the scripts/JoinBSUtoUniProt.R file.

2024-03-18:

GO association data:

BSU IDs from UniProt IDs for mapping:

2024-04-01:

  • Added 39,215 inferred ProGo edges using a Cypher command.

2024-04-03:

  • No "NOT" qualifiers were found in the dataset so there were no changes to the B. subtilis data structure during this update.

2024-06-11:

2024-06-24:

  • Remove "self-edges" from PPI data.

2024-07-30:

2024-07-31:

Regulatory data:

Current B. subtilis Network [Updated 2024-08-28]

| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 3163        | 484              | 1230             | 1449           | 6441                  | 5634               |

| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 3681     | 78015             | 14384                | 63631                  |

Danio rerio Data Sources (TXID7955)

2024-03-18:

Interaction data:

  • zfish_string_db_results.csv merged into zfish_interactome_Mar12_2024.txt. (Source)
  • Downloaded file 7955.protein.physical.links.full.v12.0.txt.gz from String-DB and filtered to experimentally validated, database-curated, and textmined interactions according to scripts/ZebrafishDataMerging.Rmd. Mar. 12, 2024.

  • 7955.protein.aliases.v12.0.txt merged into zfish_interactome_Mar12_2024.txt (Source)

  • Downloaded file from String-DB to provide UniProt IDs for STRING-DB aliases.

  • zfish_psicquic_results.csv merged into zfish_interactome_Mar12_2024.txt (Source)

  • Used a Python script scripts/GetXML.ipynb to scrape all entries for “Danio rerio” from the REST API. Removed all <entrySet> tags that were in between the first and last instance. All <xml> tags but the first were removed from the file. Got data for interactions and interactors and converted XML format to JSON using scripts/get-interactions.js and scripts/get-interactors.js. Converted the resulting JSON files to CSV using a free online convertor. Merged interactions.csv and interactors.csv into zfish_psicquic_results.csv using scripts/ZebrafishDataMerging.Rmd. Some UniProt IDs were found from the IntAct entry using the IntAct ID as documented in the Rmd.

  • zfish_id_mapper.tsv merged into zfish_interactome_Mar12_2024.txt (Source)

  • Retrieved updated UniProt entries and common names for 11,765 entries. 2781 protein entries were found to be obsolete, thus did not have a name available on UniProt. These were removed and separated into their own dataset.
  • The resulting dataset had 6,438 unique proteins.

  • zfish_gene_names.tsv merged into zfish_interactome_Mar12_2024.txt (Source)

  • Retrieved gene names for 6,438 D. rerio proteins zfish_unique_protein_ids_Mar12_24.txt from UniProt's name mapping service. For entries with a "gene name", the gene name was used as the name, for those without a gene name, the first portion of the "protein name" was used. This was decided to ensure uniqueness for the node names in the user interface.

  • Merged all D. rerio data together into one master file using the instructions in scripts/ZebrafishDataMerging.Rmd.

GO association Data:

2024-04-01:

  • Added 86,304 inferred ProGo edges using a Cypher command.

2024-04-03:

GO association data:

2024-06-11:

  • Added alt_name parameter to Neo4j import statement.

2024-06-24:

  • Remove trailing whitespaces from some names according to ZebrafishDataMerging.Rmd.
  • Remove "self-edges" from PPI data.

2024-07-30:

2024-07-31:

Regulatory data:

Current D. rerio Network [Updated 2024-08-28]

| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 4106        | 411              | 1083             | 2612           | 13915                 | 78223              |

| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 7868     | 202845            | 42898                | 159947                 |

Caenorhabditis elegans Data Sources (TXID6239)

2024-08-28:

Interaction data:

Regulatory data:

GO association data:

Current C. elegans Network [Updated 2024-08-28]

| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 16606       | 2833             | 10168            | 3605           | 45003                 | 17530              |

| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 8321     | 133619            | 29065                | 104554                 |

Saccharomyces cerevisiae Data Sources (TXID559292)

2024-08-28:

Interaction data:

Added physical interaction data (Source) and processed into interactome-txid559292-2024_08_19.txt according to yeast.R.

Regulatory data:

GO association data:

Added GO association data from QuickGO (Source) and processed into yeast_go_annotation-2024-08-08.tsv according to yeast.R using the UniProt namespace mapper.

Current S. cerevisiae Network [Updated 2024-08-28]

| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 7644        | 1092             | 858              | 5694           | 164432                | 237315             |

| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 8300     | 328186            | 69760                | 258426                 |

Gene Ontology Hierarchy Data Sources

2023-09-29:

Common name:

Relationships:

2024-03-28:

2024-07-17:

  • [go_2024-07-17.obo] processed into go_2024-07-17.txt, is_a_import_2024-07-17.tsv, and relationship_import_2024-07-17.tsv.
  • [is_a_import_2024-07-17.tsv] created with scripts/ParseOntologyRelationship.ipynb.
  • [relationship_import_2024-07-17.tsv] created with scripts/ParseOntologyRelationship.ipynb.
  • [go_2024-07-17.txt] created with scripts/ParseOBOtoTXT.ipynb and scripts/GeneOntologyNeverAnnotate.R.

Gene Ontology Data Structure [Updated 2024-07-17]

| GO Terms | "is_a" Relationships (GoGo) |
| -------- | :-------------------------- |
| 42092    | 66168                       |

Taxon ID source:

NCBI taxonomy browser Looked up species name and got taxon ID.

Versioning & Dates

2023-09-29 -- 2024-03-17 (BETA):

  • Imported weighted D. melanogaster interactome and FlyBase annotations.
  • Imported raw GO data and "is_a" relationships.

2024-03-18:

  • Added D. rerio protein interactome and GO association data.
  • Updated B. subtilis and D. melanogaster GO association networks with QuickGO data.

2024-03-28:

  • Added blacklist indicator to GO term nodes that should never have an annotation.

2024-04-01:

  • Added inferred ProGo edges from descendant ProGo edges. This means that proteins annotated to a specific GO term, such as Mbs to enzyme inhibitor activity, will also be annotated to that GO term's ancestors, such as molecular function inhibitor activity and molecular_function.
| Species         | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 415,493        |
| B. subtilis     | 39,215         |
| D. rerio        | 86,304         |
| Total           | 541,012        |

2024-04-03:

  • Removed "NOT" qualifiers (those that should not be explicitly annotated to the GO term due to experimental or other evidence) from all GO assocation datasets.
  • Repropogated the "inferred_from_descendant" edges to ensure no false propogation of "NOT" qualifiers.
| Species         | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 413,704        |
| B. subtilis     | 39,215         |
| D. rerio        | 86,216         |
| Total           | 539,135        |

2024-06-11:

  • Added B. subtilis interaction data from STRING-DB and updated QuickGO annotations.
  • Added alt_name parameters to B. subtilis and D. rerio nodes.
| Species         | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 413,704        |
| B. subtilis     | 54,270         |
| D. rerio        | 86,216         |
| Total           | 554,190        |

2024-06-24:

  • Removed trailing whitespaces from D. rerio data.
  • Removed "self-edges" i.e., interactions between two copies of the same protein to improve path algorithm performance.
    • 309 "self-edges" were removed from the data from B. subtilis and D. rerio.

2024-07-17:

  • Updated full Gene Ontology dataset including hierarchy and descriptions of GO terms.
  • The Gene Ontology removed 660 GO terms in the update. This resulted in the removal of previously existing edges to GO terms.
    • Removed 28,571 ProGo edges in D. melanogaster.
    • Removed 5,553 ProGo edges in B. subtilis.
    • Removed 5,619 ProGo edges in D. rerio.
    • Removed 2,140 GoGo edges.
    • Removed 41,883 edges total.

2024-07-30:

  • Added properties to ProPro edges that provide more information about the protein-protein interaction.
    • D. melanogaster (TXID7227): added PubMed IDs.
    • B. subtilis (TXID224308): added links to STRING-DB entries.
    • D. rerio (TXID7955): added PubMed IDs to those available and STRING-DB links to those not.

2024-07-31:

  • Added genetic regulatory interactions and sources for D. melanogaster, B. subtilis, and D. rerio.
    • D. melanogaster (TXID7227):
      • Nodes: 1,322
      • Reg Edges: 17,530
      • ProGo Edges: 10,460
    • B. subtilis (TXID224308):
      • Nodes: 1,230
      • Reg Edges: 5,634
      • ProGo Edges: 18,547
    • D. rerio (TXID7955):
      • Nodes: 10,168
      • Reg Edges: 25,960
      • ProGo Edges: 30,540

2024-08-19:

  • Added data source as a column to all regulatory and physical datasets.
  • Updated naming convention for datasets: [type-txid-TODAY_DATE.txt].

2024-08-28:

  • Added two new species: C. elegans and S. cerevisiae.
    • C. elegans (TXID6239):
      • Nodes: 4,106
      • PPI Edges: 13,915
      • Reg Edges: 78,223
      • ProGo Edges: 202,845
    • S. cerevisiae (TXID559292):
      • Nodes: 7,644
      • PPI Edges: 164,432
      • Reg Edges: 237,315
      • ProGo Edges: 328,186