Skip to content

Trouble with matching GoSlim categories for analysis #2408

@hannahnowers

Description

@hannahnowers

I am working on creating bar and pie charts comparing the # of sequences in each biological process category for the annotated sea star protein seqs I created using the annotation pipeline. I wrote out the workflow to create the bar charts that I thought would work below. I am having trouble getting the biological process categories from my annotation_with_goslim.TSV file for the pisaster proteins to match the biological process categories I can see on the summary.md file.

To create the bar chart, this is the process I was thinking:

  1. From my Annotation_with_goslim.TSV, create a new column that list the species. Do this for all three annotation_with_goslim.TSV files.
  2. Need to separate the go-slim terms into separate columns because they are currently listed in one column separated by ";"
  3. Then I want to filter the data so I only have species and go_slim names.
  4. Then I would like to merge those three data files so I have a data frame that had two columns; species, and GoSlim_names.
  5. Then using that data frame, I want to create a bar chart that counts # of sequence in each biological process category for each species. The bar chart would have the BPs on the y and # of sequence on the x. Each species will be represented by a different color.

Here is the input data file I used for the pipeline: https://raw.githubusercontent.com/hannahnowers/Seastar-capstone/refs/heads/main/data/pisaster-protein.fa

Here is the code I am running for the pisaster annotation file: I also have it uploaded to Github(https://github.com/hannahnowers/Seastar-capstone/blob/main/code/06-GoSlimTerms-Pipeline.Rmd)


library(ggplot2) #For plotting
library(tidyr)
library(readr)
library(dplyr)

# Creating data file
pisaster_anot <- read_tsv("output/pisaster_protein_annotation/annotation_with_goslim.tsv")

#Checking first rows of file
head(pisaster_anot)


#Cleaning Dataset
pisaster_slim <- pisaster_anot %>% select(query, goslim_names) #Creating data file with only query and goslim_names (look at query v accession number)

#Removing empty rows
pisaster_slim <- pisaster_slim %>% filter(!is.na(goslim_names)) #Removing the empty values. From the summary.md file can see that only 31.1% of terms mapped with Go Slim terms to lots of empty values in data set

#Seperating Go-Slim terms into seperate rows and removing duplicates
pisaster_slim <- pisaster_slim %>% separate_rows(goslim_names, sep = ";") #Since there are multiple goslim terms per query need to seperate them

#Trim Extra spaces
pisaster_slim$goslim_names <- trimws(pisaster_slim$goslim_names) #Some of the go-slim terms may have extra spaces infront of them

#Remove duplicates
pisaster_slim <- pisaster_slim %>% distinct(query, goslim_names) #Since multiple hits per query need to remove duplicates

#Count sequences per Go Category
pisaster_counts <- pisaster_slim %>% count(goslim_names, sort=TRUE) #Checking Does this match summary.md file for pisaster. There should be multiple hits for each query so need to make sure numbers match

top15 <- pisaster_counts %>%
  slice_max(n, n = 15) #Not showing the same categories

This is the top 15 categories my code is producing from the annotation_with_goslim.tsv file for the pisaster proteins:

Image

This is what the summary.MD for the pisaster protein shows:

Image

Do you have any suggestions for what I am doing wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions