ProteinWeaver Data Log & Version
This section of the documentation outlines the data sources, processing steps and versions of the ProteinWeaver web interface.
Drosophila melanogaster Data Sources (TXID7227)
2023-09-29 (BETA):
Interaction data:
GO association data:
2024-03-18:
GO association data:
dmel_GO_data_Mar15_24.tsv
(Source)- Downloaded and merged data together in
scripts/SubColNames.R
.
FlyBase IDs from UniProt IDs for mapping:
idmapping_2024_03_18.tsv
(Source)- Downloaded from UniProt and merged with GO data from QuickGO to match the FlyBase naming convention. Renamed columns to "GENE_PRODUCT_ID" and "FB_ID" and merged in
scripts/SubColNames.R
.
2024-04-01:
- Added 415,493 inferred ProGo edges using a Cypher command.
2024-04-03:
GO association data:
gene_association_fb_2024-04-03.tsv
dmel_GO_data_2024-04-03.tsv
- Removed qualifiers with "NOT" preceding them using `scripts/RemoveNotQualifier.R
- Reduced inferred ProGo edges to 413,704.
2024-07-30:
- Added PubMed identifiers as a property to ProPro edges.
2024-07-31:
Regulatory data:
- Added genetic regulatory data (Source) and processed into
regulatory_txid7227_2024-07-31.txt
according toSplitRegulatoryColumns7227.R
.- Nodes: 1322
- Reg Edges: 17530
Current D. melanogaster Network [Updated 2024-08-28]
| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 12823 | 7905 | 1322 | 3596 | 233054 | 17530 |
| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 11774 | 492331 | 98799 | 393532 |
Bacillus subtilis Data Sources (TXID224308)
2023-10-18 (BETA):
Interaction data:
bsub_interactome.csv
- Source
- Exported the “Interaction” set and renamed to
bsub_interactome.csv
.
GO association data:
subtiwiki.gene.export.2023-10-18.tsv
processed and merged intobsub_GO_data.csv
(Source)-
Exported the “Gene” set with all options selected. Processed and merged the file according to
scripts/JoinBSUtoUniProt.R
. - Selected all annotations for B. subtilis and used the following bash command to download:
wget 'https://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=source,bioentity_internal_id,bioentity_label,qualifier,annotation_class,reference,evidence_type,evidence_with,aspect,bioentity_name,synonym,type,taxon,date,assigned_by,annotation_extension_class,bioentity_isoform&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22annotation%22&fq=taxon_subset_closure_label:%22Bacillus%20subtilis%20subsp.%20subtilis%20str.%20168%22&facet.field=aspect&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=evidence_subset_closure_label&facet.field=regulates_closure_label&facet.field=isa_partof_closure_label&facet.field=annotation_class_label&facet.field=qualifier&facet.field=annotation_extension_class_closure_label&facet.field=assigned_by&facet.field=panther_family_label&q=*:*'
- File was renamed to
bsub_go_uniprot.tsv
, processed and merged intobsub_GO_data.csv
according to thescripts/JoinBSUtoUniProt.R
file.
2024-03-18:
GO association data:
bsub_GO_data_Mar18_24.tsv
(Source)- Downloaded and merged data together in
scripts/SubColNames.R
and imported withdata/README.md
.
BSU IDs from UniProt IDs for mapping:
subtiwiki.gene.export.2024-03-18.tsv
(Source)- Selected BSU and UniProt outlinks from menu and exported. Renamed columns to "GENE_PRODUCT_ID" and "BSU_ID" to remove special characters. Merged in
scripts/SubColNames.R
.
2024-04-01:
- Added 39,215 inferred ProGo edges using a Cypher command.
2024-04-03:
- No "NOT" qualifiers were found in the dataset so there were no changes to the B. subtilis data structure during this update.
2024-06-11:
- Added new interaction data from STRING-DB.
- Downloaded physical interactions full
224308.protein.physical.links.full.v12.0.txt
and224308.protein.info.v12.0.txt
and merged both intointeractome_txid224308_2024-06-06.txt
and cleaned according toBsubDataMerging.Rmd
. - Added updated GO term edges for B. subtilis after new data import.
- Downloaded all reviewed annotations from QuickGO (Source) and downloaded UniProt and BSU ID mapper
subtiwiki.gene.export.2024-06-03.tsv
from SubtiWiki. - Merged the two into
annotations_txid224308_2024-06-03.txt
according toBsubDataMerging.Rmd
.
2024-06-24:
- Remove "self-edges" from PPI data.
2024-07-30:
interactome_txid224308_2024-07-30.txt
: Added "link" and "source" properties to edges to link to STRING-DB entries for ProPro interactions.
2024-07-31:
Regulatory data:
- Downloaded genetic regulatory data (Regulations dataset) and renamed it
regulatory_txid224308_2024-07-31.csv
. Imported without further modifications.- Nodes: 1230
- Reg Edges: 5634
Current B. subtilis Network [Updated 2024-08-28]
| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 3163 | 484 | 1230 | 1449 | 6441 | 5634 |
| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 3681 | 78015 | 14384 | 63631 |
Danio rerio Data Sources (TXID7955)
2024-03-18:
Interaction data:
zfish_string_db_results.csv
merged intozfish_interactome_Mar12_2024.txt
. (Source)-
Downloaded file
7955.protein.physical.links.full.v12.0.txt.gz
from String-DB and filtered to experimentally validated, database-curated, and textmined interactions according toscripts/ZebrafishDataMerging.Rmd
. Mar. 12, 2024. -
7955.protein.aliases.v12.0.txt
merged intozfish_interactome_Mar12_2024.txt
(Source) -
Downloaded file from String-DB to provide UniProt IDs for STRING-DB aliases.
-
zfish_psicquic_results.csv
merged intozfish_interactome_Mar12_2024.txt
(Source) -
Used a Python script
scripts/GetXML.ipynb
to scrape all entries for “Danio rerio” from the REST API. Removed all<entrySet>
tags that were in between the first and last instance. All<xml>
tags but the first were removed from the file. Got data for interactions and interactors and converted XML format to JSON usingscripts/get-interactions.js
andscripts/get-interactors.js
. Converted the resulting JSON files to CSV using a free online convertor. Mergedinteractions.csv
andinteractors.csv
intozfish_psicquic_results.csv
usingscripts/ZebrafishDataMerging.Rmd
. Some UniProt IDs were found from the IntAct entry using the IntAct ID as documented in the Rmd. -
zfish_id_mapper.tsv
merged intozfish_interactome_Mar12_2024.txt
(Source) - Retrieved updated UniProt entries and common names for 11,765 entries. 2781 protein entries were found to be obsolete, thus did not have a name available on UniProt. These were removed and separated into their own dataset.
-
The resulting dataset had 6,438 unique proteins.
-
zfish_gene_names.tsv
merged intozfish_interactome_Mar12_2024.txt
(Source) -
Retrieved gene names for 6,438 D. rerio proteins
zfish_unique_protein_ids_Mar12_24.txt
from UniProt's name mapping service. For entries with a "gene name", the gene name was used as the name, for those without a gene name, the first portion of the "protein name" was used. This was decided to ensure uniqueness for the node names in the user interface. -
Merged all D. rerio data together into one master file using the instructions in
scripts/ZebrafishDataMerging.Rmd
.
GO association Data:
zfish_GO_data_Mar12_24.tsv
(Source)- Used QuickGO to get all 65,876 "Reviewed" GO annotations for D. rerio. Replaced the " " in headers with "_" to ease data import.
2024-04-01:
- Added 86,304 inferred ProGo edges using a Cypher command.
2024-04-03:
GO association data:
zfish_GO_data_2024-04-03.tsv
- Removed qualifiers with "NOT" preceding them using `scripts/RemoveNotQualifier.R
- Reduced inferred ProGo edges to 86,216.
2024-06-11:
- Added alt_name parameter to Neo4j import statement.
2024-06-24:
- Remove trailing whitespaces from some names according to
ZebrafishDataMerging.Rmd
. - Remove "self-edges" from PPI data.
2024-07-30:
interactome_txid7955_2024-07-30.txt
: Added "link" and "source" properties to edges to link to STRING-DB or PubMed entries (if available) for ProPro interactions.
2024-07-31:
Regulatory data:
- Added genetic regulatory data (Source) and processed into
regulatory_txid7955_2024-07-31.txt
according toSubColNames.R
.- Nodes: 10168
- Reg Edges: 25960
Current D. rerio Network [Updated 2024-08-28]
| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 16606 | 2833 | 10168 | 3605 | 45003 | 25960 |
| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 8321 | 133619 | 29065 | 104554 |
Caenorhabditis elegans Data Sources (TXID6239)
2024-08-28:
Interaction data:
- Added physical interaction data from WormBase (Source) and filtered to only UniProt verified proteins with physical interactions producing
interactome-txid6239-2024_08_19.txt
according toelegans.R
.
Regulatory data:
- Added genetic regulatory data (Source) and processed into
regulatory-txid6239-2024_08_19.txt
according toelegans.R
.
GO association data:
- Added GO assocation data
elegans_go_annotation_2024-08-08.tsv
from WormBase (Source) and filtered to only UniProt verified proteins according toelegans.R
.
Current C. elegans Network [Updated 2024-08-28]
| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 4106 | 411 | 1083 | 2612 | 13915 | 78223 |
| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 7868 | 202845 | 42898 | 159947 |
Saccharomyces cerevisiae Data Sources (TXID559292)
2024-08-28:
Interaction data:
Added physical interaction data (Source) and processed into interactome-txid559292-2024_08_19.txt
according to yeast.R
.
Regulatory data:
- Added genetic regulatory data (Source) and processed into
regulatory-txid559292-2024_08_19.txt
according toyeast.R
.
GO association data:
Added GO association data from QuickGO (Source) and processed into yeast_go_annotation-2024-08-08.tsv
according to yeast.R
using the UniProt namespace mapper.
Current S. cerevisiae Network [Updated 2024-08-28]
| Nodes (All) | Nodes (PPI-Only) | Nodes (GRN-Only) | Nodes (Shared) | Interactions (ProPro) | Interactions (Reg) |
| ----------- | ---------------- | ---------------- | -------------- | --------------------- | ------------------ |
| 7644 | 1092 | 858 | 5694 | 164432 | 237315 |
| GO Terms | Annotations (All) | Annotations (Direct) | Annotations (Inferred) |
| -------- | ----------------- | -------------------- | ---------------------- |
| 8300 | 328186 | 69760 | 258426 |
Gene Ontology Hierarchy Data Sources
2023-09-29:
Common name:
go.obo
processed intogo.txt
(Source)- Used
wget
to download the file. Processed the file usingscripts/ParseOBOtoTXT.ipynb
.
Relationships:
go.obo
processed intois_a_import.tsv
-
Processed the file using
scripts/ParseOntologyRelationship.ipynb
. -
go.obo
processed intorelationship_import.tsv
- Processed the file using
scripts/ParseOntologyRelationship.ipynb
.
2024-03-28:
goNeverAnnotate.txt
joined withgo.txt
intogo_2024-03-28.txt
-
Joined the data together with
scripts/GeneOntologyNeverAnnotate.R
. -
gocheck_do_not_annotate.txt
parsed fromgocheck_do_not_annotate.obo
(Source) usingscripts/ParseOBOtoTXT.ipynb
and merged intogo_2024-03-28.txt
.
2024-07-17:
- [
go_2024-07-17.obo
] processed intogo_2024-07-17.txt
,is_a_import_2024-07-17.tsv
, andrelationship_import_2024-07-17.tsv
. - [
is_a_import_2024-07-17.tsv
] created withscripts/ParseOntologyRelationship.ipynb
. - [
relationship_import_2024-07-17.tsv
] created withscripts/ParseOntologyRelationship.ipynb
. - [
go_2024-07-17.txt
] created withscripts/ParseOBOtoTXT.ipynb
andscripts/GeneOntologyNeverAnnotate.R
.
Gene Ontology Data Structure [Updated 2024-07-17]
| GO Terms | "is_a" Relationships (GoGo) |
| -------- | :-------------------------- |
| 42092 | 66168 |
Taxon ID source:
NCBI taxonomy browser Looked up species name and got taxon ID.
Versioning & Dates
2023-09-29 -- 2024-03-17 (BETA):
- Imported weighted D. melanogaster interactome and FlyBase annotations.
- Imported raw GO data and "is_a" relationships.
2024-03-18:
- Added D. rerio protein interactome and GO association data.
- Updated B. subtilis and D. melanogaster GO association networks with QuickGO data.
2024-03-28:
- Added blacklist indicator to GO term nodes that should never have an annotation.
2024-04-01:
- Added inferred ProGo edges from descendant ProGo edges. This means that proteins annotated to a specific GO term, such as Mbs to enzyme inhibitor activity, will also be annotated to that GO term's ancestors, such as molecular function inhibitor activity and molecular_function.
| Species | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 415,493 |
| B. subtilis | 39,215 |
| D. rerio | 86,304 |
| Total | 541,012 |
2024-04-03:
- Removed "NOT" qualifiers (those that should not be explicitly annotated to the GO term due to experimental or other evidence) from all GO assocation datasets.
- Repropogated the "inferred_from_descendant" edges to ensure no false propogation of "NOT" qualifiers.
| Species | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 413,704 |
| B. subtilis | 39,215 |
| D. rerio | 86,216 |
| Total | 539,135 |
2024-06-11:
- Added B. subtilis interaction data from STRING-DB and updated QuickGO annotations.
- Added alt_name parameters to B. subtilis and D. rerio nodes.
| Species | Inferred Edges |
| --------------- | :------------- |
| D. melanogaster | 413,704 |
| B. subtilis | 54,270 |
| D. rerio | 86,216 |
| Total | 554,190 |
2024-06-24:
- Removed trailing whitespaces from D. rerio data.
- Removed "self-edges" i.e., interactions between two copies of the same protein to improve path algorithm performance.
- 309 "self-edges" were removed from the data from B. subtilis and D. rerio.
2024-07-17:
- Updated full Gene Ontology dataset including hierarchy and descriptions of GO terms.
- The Gene Ontology removed 660 GO terms in the update. This resulted in the removal of previously existing edges to GO terms.
- Removed 28,571 ProGo edges in D. melanogaster.
- Removed 5,553 ProGo edges in B. subtilis.
- Removed 5,619 ProGo edges in D. rerio.
- Removed 2,140 GoGo edges.
- Removed 41,883 edges total.
2024-07-30:
- Added properties to ProPro edges that provide more information about the protein-protein interaction.
- D. melanogaster (TXID7227): added PubMed IDs.
- B. subtilis (TXID224308): added links to STRING-DB entries.
- D. rerio (TXID7955): added PubMed IDs to those available and STRING-DB links to those not.
2024-07-31:
- Added genetic regulatory interactions and sources for D. melanogaster, B. subtilis, and D. rerio.
- D. melanogaster (TXID7227):
- Nodes: 1,322
- Reg Edges: 17,530
- ProGo Edges: 10,460
- B. subtilis (TXID224308):
- Nodes: 1,230
- Reg Edges: 5,634
- ProGo Edges: 18,547
- D. rerio (TXID7955):
- Nodes: 10,168
- Reg Edges: 25,960
- ProGo Edges: 30,540
- D. melanogaster (TXID7227):
2024-08-19:
- Added data source as a column to all regulatory and physical datasets.
- Updated naming convention for datasets: [
type-txid-TODAY_DATE.txt
].
2024-08-28:
- Added two new species: C. elegans and S. cerevisiae.
- C. elegans (TXID6239):
- Nodes: 4,106
- PPI Edges: 13,915
- Reg Edges: 78,223
- ProGo Edges: 202,845
- S. cerevisiae (TXID559292):
- Nodes: 7,644
- PPI Edges: 164,432
- Reg Edges: 237,315
- ProGo Edges: 328,186
- C. elegans (TXID6239):