Trouble with matching GoSlim categories for analysis

I am working on creating bar and pie charts comparing the # of sequences in each biological process category for the annotated sea star protein seqs I created using the annotation pipeline. I wrote out the workflow to create the bar charts that I thought would work below.  I am having trouble getting the biological process categories from my annotation_with_goslim.TSV file for the pisaster proteins to match the biological process categories I can see on the summary.md file. 

To create the bar chart, this is the process I was thinking:

1. From my Annotation_with_goslim.TSV, create a new column that list the species. Do this for all three annotation_with_goslim.TSV files.
2. Need to separate the go-slim terms into separate columns because they are currently listed in one column separated by ";"
3. Then I want to filter the data so I only have species and go_slim names.
4. Then I would like to merge those three data files so I have a data frame that had two columns; species, and GoSlim_names.
5. Then using that data frame, I want to create a bar chart that counts # of sequence in each biological process category for each species. The bar chart would have the BPs on the y and # of sequence on the x. Each species will be represented by a different color. 

Here is the input data file I used for the pipeline: https://raw.githubusercontent.com/hannahnowers/Seastar-capstone/refs/heads/main/data/pisaster-protein.fa

Here is the code I am running for the pisaster annotation file: I also have it uploaded to Github(https://github.com/hannahnowers/Seastar-capstone/blob/main/code/06-GoSlimTerms-Pipeline.Rmd)

```{r, Loading packages}

library(ggplot2) #For plotting
library(tidyr)
library(readr)
library(dplyr)
```

```{r, Loading in Dataset}

# Creating data file
pisaster_anot <- read_tsv("output/pisaster_protein_annotation/annotation_with_goslim.tsv")

#Checking first rows of file
head(pisaster_anot)

```

```{r, Cleaning Data}

#Cleaning Dataset
pisaster_slim <- pisaster_anot %>% select(query, goslim_names) #Creating data file with only query and goslim_names (look at query v accession number)

#Removing empty rows
pisaster_slim <- pisaster_slim %>% filter(!is.na(goslim_names)) #Removing the empty values. From the summary.md file can see that only 31.1% of terms mapped with Go Slim terms to lots of empty values in data set

#Seperating Go-Slim terms into seperate rows and removing duplicates
pisaster_slim <- pisaster_slim %>% separate_rows(goslim_names, sep = ";") #Since there are multiple goslim terms per query need to seperate them

#Trim Extra spaces
pisaster_slim$goslim_names <- trimws(pisaster_slim$goslim_names) #Some of the go-slim terms may have extra spaces infront of them

#Remove duplicates
pisaster_slim <- pisaster_slim %>% distinct(query, goslim_names) #Since multiple hits per query need to remove duplicates

#Count sequences per Go Category
pisaster_counts <- pisaster_slim %>% count(goslim_names, sort=TRUE) #Checking Does this match summary.md file for pisaster. There should be multiple hits for each query so need to make sure numbers match

top15 <- pisaster_counts %>%
  slice_max(n, n = 15) #Not showing the same categories

```

This is the top 15 categories my code is producing from the annotation_with_goslim.tsv file for the pisaster proteins: 

<img width="349" height="371" alt="Image" src="https://github.com/user-attachments/assets/a25df4d3-d348-40f3-b6aa-f871041eb970" />

This is what the summary.MD for the pisaster protein shows: 

<img width="350" height="411" alt="Image" src="https://github.com/user-attachments/assets/838d8453-319b-464f-98b4-ca11343e12ae" />

Do you have any suggestions for what  I am doing wrong?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with matching GoSlim categories for analysis #2408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trouble with matching GoSlim categories for analysis #2408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions