Title: | A Grammar of Graphics for Comparative Genomics |
---|---|
Description: | An extension of 'ggplot2' for creating complex genomic maps. It builds on the power of 'ggplot2' and 'tidyverse' adding new 'ggplot2'-style geoms & positions and 'dplyr'-style verbs to manipulate the underlying data. It implements a layout concept inspired by 'ggraph' and introduces tracks to bring tidiness to the mess that is genomics data. |
Authors: | Thomas Hackl [aut, cre], Markus J. Ankenbrand [aut], Bart van Adrichem [aut], Kristina Haslinger [ctb, sad] |
Maintainer: | Thomas Hackl <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-03-01 06:54:35 UTC |
Source: | https://github.com/thackl/gggenomes |
Add different types of tracks
add_feats(x, ...) add_links(x, ..., .adjacent_only = TRUE) add_subfeats(x, ..., .track_id = "genes", .transform = "aa2nuc") add_sublinks(x, ..., .track_id = "genes", .transform = "aa2nuc") add_clusters(x, ..., .track_id = "genes")
add_feats(x, ...) add_links(x, ..., .adjacent_only = TRUE) add_subfeats(x, ..., .track_id = "genes", .transform = "aa2nuc") add_sublinks(x, ..., .track_id = "genes", .transform = "aa2nuc") add_clusters(x, ..., .track_id = "genes")
x |
object to add the tracks to (e.g. gggenomes, gggenomes_layout) |
... |
named data.frames, i.e. genes=gene_df, snps=snp_df |
.adjacent_only |
indicate whether links should be drawn only between vertically adjacent tracks |
.track_id |
track_id of the feats that subfeats, sublinks or clusters map to. |
.transform |
one of "aa2nuc", "none", "nuc2aa" |
gggenomes object with added features
add_feats()
: Add feature annotations to sequences
add_links()
: Add links connecting sequences, such as whole-genome
alignment data.
add_subfeats()
: Add features of features, such as gene/protein
domains, blast hits to genes/proteins, etc.
add_sublinks()
: Add links that connect features, such as
protein-protein alignments connecting genes.
add_clusters()
: Add gene clusters or other feature groups. Takes a
data.frame with at least two required columns cluster_id
and feat_id
. The
data.frame is converted to a link track connecting features belonging to the
same cluster over their entire length. Additionally, the data.frame is joined
to the parent feature track, adding cluster_id
and all additional columns
to the parent table.
# Add some repeat annotations gggenomes(seqs = emale_seqs) %>% add_feats(repeats = emale_tirs) + geom_seq() + geom_feat() # Add all-vs-all whole-genome alignments gggenomes(seqs = emale_seqs) %>% add_links(links = emale_ava) + geom_seq() + geom_link() # Add domains to genes genes <- tibble::tibble(seq_id = "A", start = 100, end = 200, feat_id = "gene1") domains <- tibble::tibble(feat_id = "gene1", start = 40, end = 80) gggenomes(genes = genes) %>% add_subfeats(domains, .transform = "none") + geom_gene() + geom_feat() # Add protein-protein alignments gggenomes(emale_genes) %>% add_sublinks(emale_prot_ava) + geom_gene() + geom_link() # add clusters gggenomes(emale_genes, emale_seqs) %>% add_clusters(emale_cogs) %>% sync() + # works because clusters geom_link() + # become links geom_seq() + # works because cluster info is joined to gene track geom_gene(aes(fill = ifelse(is.na(cluster_id), NA, stringr::str_glue("{cluster_id} [{cluster_size}]") ))) + scale_fill_discrete("COGs")
# Add some repeat annotations gggenomes(seqs = emale_seqs) %>% add_feats(repeats = emale_tirs) + geom_seq() + geom_feat() # Add all-vs-all whole-genome alignments gggenomes(seqs = emale_seqs) %>% add_links(links = emale_ava) + geom_seq() + geom_link() # Add domains to genes genes <- tibble::tibble(seq_id = "A", start = 100, end = 200, feat_id = "gene1") domains <- tibble::tibble(feat_id = "gene1", start = 40, end = 80) gggenomes(genes = genes) %>% add_subfeats(domains, .transform = "none") + geom_gene() + geom_feat() # Add protein-protein alignments gggenomes(emale_genes) %>% add_sublinks(emale_prot_ava) + geom_gene() + geom_link() # add clusters gggenomes(emale_genes, emale_seqs) %>% add_clusters(emale_cogs) %>% sync() + # works because clusters geom_link() + # become links geom_seq() + # works because cluster info is joined to gene track geom_gene(aes(fill = ifelse(is.na(cluster_id), NA, stringr::str_glue("{cluster_id} [{cluster_size}]") ))) + scale_fill_discrete("COGs")
Add seqs
add_seqs(x, seqs, ...)
add_seqs(x, seqs, ...)
x |
a gggenomes or gggenomes_layout objekt |
seqs |
the sequences to add |
... |
pass through to |
a gggenomes or gggenomes_layout object with added seqs
Check strand
check_strand(strand, na)
check_strand(strand, na)
strand |
some representation for strandedness |
na |
what to use for |
strand vector with unknown values replaced by na
Combine strands
combine_strands(strand, strand2, ...)
combine_strands(strand, strand2, ...)
strand |
first strand |
strand2 |
second strand |
... |
more strands |
the combined strand
For seamless reading of different file formats, gggenomes uses a mapping of
known formats to associated file extensions and contexts in which the
different formats can be read. The notion of context allows one to read
different information from the same format/extension. For example, a gbk file
holds both feature and sequence information. If read in "feats" context
read_feats("*.gbk")
it will return a feature table, if read in "seqs"
context read_seqs("*.gbk")
, a sequence index.
def_formats( file = NULL, ext = NULL, context = NULL, parser = NULL, allow_na = FALSE )
def_formats( file = NULL, ext = NULL, context = NULL, parser = NULL, allow_na = FALSE )
file |
a vector of file names |
ext |
a vector of file extensions |
context |
a vector of file contexts defined in
|
parser |
a vector of file parsers defined in
|
allow_na |
boolean |
dictionarish vector of file formats with recognized extensions as names
format ext context parser 1 ambigious txt, tsv, csv NA read_ambigious 2 fasta fa, fas, fasta, ffn, fna, faa seqs read_seq_len 3 fai fai seqs read_fai 4 gff3 gff, gff3, gff2, gtf feats, seqs read_gff3, read_seq_len 5 gbk gbk, gb, gbff, gpff feats, seqs read_gbk, read_seq_len 6 bed bed feats read_bed 7 blast m8, o6, o7 feats, links read_blast, read_blast 8 paf paf feats, links read_paf, read_paf 9 alitv json feats, seqs, links read_alitv_genes, read_alitv_seqs, read_alitv_links 10 vcf vcf feats read_vcf
# vector of defined zip formats and recognized extensions as names # format of file def_formats("foo.fa") # formats associated with each extension def_formats(ext = c("fa", "gff")) # all formats/extensions that can be read in seqs context; includes formats # that are defined for context=NA, i.e. that can be read in any context. def_formats(context = "seqs")
# vector of defined zip formats and recognized extensions as names # format of file def_formats("foo.fa") # formats associated with each extension def_formats(ext = c("fa", "gff")) # all formats/extensions that can be read in seqs context; includes formats # that are defined for context=NA, i.e. that can be read in any context. def_formats(context = "seqs")
Intended to be used in readr::read_tsv()
-like functions that accept a
col_names
and a col_types
argument.
def_names(format) def_types(format)
def_names(format) def_types(format)
format |
specify a format known to gggenomes, such as |
a vector with default column names for the given format
a vector with default column types for the given format
def_names()
: default column names for defined formats
def_types()
: default column types for defined formats
gff3 ccciicccc seq_id,source,type,start,end,score,strand,phase,attributes paf ciiicciiiiid seq_id,length,start,end,strand,seq_id2,length2,start2,end2,map_match,map_length,map_quality blast ccdiiiiiiidd seq_id,seq_id2,pident,length,mismatch,gapopen,start,end,start2,end2,evalue,bitscore bed ciicdc seq_id,start,end,name,score,strand fai ci--- seq_id,seq_desc,length seq_len cci seq_id,seq_desc,length vcf cicccdccc seq_id,start,feat_id,ref,alt,qual,filter,info,format
# read a blast-tabular file with read_tsv readr::read_tsv(ex("emales/emales-prot-ava.o6"), col_names = def_names("blast"))
# read a blast-tabular file with read_tsv readr::read_tsv(ex("emales/emales-prot-ava.o6"), col_names = def_names("blast"))
Drop feature layout
drop_feat_layout(x, keep = "strand")
drop_feat_layout(x, keep = "strand")
x |
feat_layout |
keep |
features to keep |
feat_layout without unwanted features
Drop a genome layout
drop_layout(data, ...)
drop_layout(data, ...)
data |
layout |
... |
additional data |
gggenomes object without layout
Drop a link layout
drop_link_layout(x, keep = "strand")
drop_link_layout(x, keep = "strand")
x |
link_layout |
keep |
features to keep |
link_layout without unwanted features
Drop a seq layout
drop_seq_layout(x, keep = "strand")
drop_seq_layout(x, keep = "strand")
x |
seq_layout |
keep |
features to keep |
seq_layout without unwanted features
One row per alignment block. Alignments were computed with minimap2.
emale_ava
emale_ava
A data frame with 125 rows and 23 columns
name of the file the data was read from
identifier of the sequence the feature appears on
length of the sequence
start of the feature on the sequence
end of the feature on the sequence
orientation of the feature relative to the sequence (+ or -)
identifier of the sequence the feature appears on
length of the sequence
start of the feature on the sequence
end of the feature on the sequence
see https://github.com/lh3/miniasm/blob/master/PAF.md for additional columns
Derived & bundled data: ex("emales/emales.paf")
One row per feature. Clusters are based on manual curation.
emale_cogs
emale_cogs
A data frame with 48 rows and 3 columns
identifier of the cluster
identifer of the gene
number of features in the cluster
Derived & bundled data: ex("emales/emales-cogs.tsv")
One row per 50 bp window.
emale_gc
emale_gc
A data frame with 2856 rows and 6 columns
name of the file the data was read from
identifier of the sequence the feature appears on
start of the feature on the sequence
end of the feature on the sequence
name of the feature
relative GC-content of the window
Derived & bundled data: ex("emales/emales-gc.bed")
A data set containing gene feature annotations for 6 endogenous virophages found in the genomes of the marine protist Cafeteria burkhardae.
emale_genes
emale_genes
A data frame with 143 rows and 17 columns
name of the file the data was read from
identifier of the sequence the feature appears on
start of the feature on the sequence
end of the feature on the sequence
reading orientation relative to sequence (+ or -)
feature type (CDS, mRNA, gene, ...)
unique identifier of the feature
a list column with internal intron start/end positions
a list column with parent IDs - feat_id's of parent features
source of the annotation
score of the annotation
For "CDS" features indicates where the next codon begins relative to the 5' start
width of the feature
relative GC-content of the feature
name of the feature
an identifier telling the which features should be plotted as on items (usually CDS and mRNA of same gene)
Publication: doi:10.1101/2020.11.30.404863
Raw data: https://github.com/thackl/cb-emales
Derived & bundled data: ex("emales/emales.gff")
Integrated Ngaro retrotransposons of 6 EMALE genomes
emale_ngaros
emale_ngaros
A data frame with 3 rows and 14 columns
name of the file the data was read from
identifier of the sequence the feature appears on
start of the feature on the sequence
end of the feature on the sequence
orientation of the feature relative to the sequence (+ or -)
feature type (CDS, mRNA, gene, ...)
unique identifier of the feature
a list column with internal intron start/end positions
a list column with parent IDs - feat_id's of parent features
source of the annotation
score of the annotation
For "CDS" features indicates where the next codon begins relative to the 5' start
name of the feature
an identifier telling the which features should be plotted as on items (usually CDS and mRNA of same gene)
Publication: doi:10.1101/2020.11.30.404863
Raw data: https://github.com/thackl/cb-emales
Derived & bundled data: ex("emales/emales-ngaros.gff")
One row per alignment. Alignments were computed with mmseqs2 (blast-like).
emale_prot_ava
emale_prot_ava
A data frame with 827 rows and 13 columns
name of the file the data was read from
identifier of the first feature in the alignment
identifier of the second feature in the alignment
see https://github.com/seqan/lambda/wiki/BLAST-Output-Formats for BLAST-tabular format columns
Derived & bundled data: ex("emales/emales-prot-ava.o6")
A data set containing the sequence information on 6 endogenous virophages found in the genomes of the marine protist Cafeteria burkhardae.
emale_seqs
emale_seqs
A data frame with 6 rows and 4 columns
name of the file the data was read from
sequence identifier
sequence description
length of the sequence
Publication: doi:10.1101/2020.11.30.404863
Raw data: https://github.com/thackl/cb-emales
Derived & bundled data: ex("emales/emales.fna")
Terminal inverted repeats of 6 EMALE genomes
emale_tirs
emale_tirs
A data frame with 3 rows and 14 columns
name of the file the data was read from
identifier of the sequence the feature appears on
start of the feature on the sequence
end of the feature on the sequence
reading orientation relative to sequence (+ or -)
feature type (CDS, mRNA, gene, ...)
unique identifier of the feature
a list column with internal intron start/end positions
a list column with parent IDs - feat_id's of parent features
source of the annotation
score of the annotation
For "CDS" features indicates where the next codon begins relative to the 5' start
name of the feature
end-start+1
an identifier telling the which features should be plotted as on items (usually CDS and mRNA of same gene)
Publication: doi:10.1101/2020.11.30.404863
Raw data: https://github.com/thackl/cb-emales
Derived & bundled data: ex("emales/emales-tirs.gff")
Get path to gggenomes example files
ex(file = NULL)
ex(file = NULL)
file |
name of example file |
path to example file
geom_*
callsTrack selection works like dplyr::pull()
and supports unquoted ids and
positional arguments. ...
can be used to subset the data in
dplyr::filter()
fashion. pull
-prefixed variants return the specified
track from a gggenome object. Unprefixed variants work inside geom_*
calls.
feats(.track_id = 1, ..., .ignore = "genes", .geneify = FALSE) feats0(.track_id = 1, ..., .ignore = NA, .geneify = FALSE) genes(..., .gene_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA")) links(.track_id = 1, ..., .ignore = NULL, .adjacent_only = NULL) seqs(...) bins(..., .group = vars()) track(.track_id = 1, ..., .track_type = NULL, .ignore = NULL) pull_feats(.x, .track_id = 1, ..., .ignore = "genes", .geneify = FALSE) pull_genes( .x, ..., .gene_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA") ) pull_links(.x, .track_id = 1, ..., .ignore = NULL, .adjacent_only = NULL) pull_seqs(.x, ...) pull_bins(.x, ..., .group = vars()) ## S3 method for class 'gggenomes_layout' pull_bins(.x, ..., .group = vars()) pull_track(.x, .track_id = 1, ..., .track_type = NULL, .ignore = NULL)
feats(.track_id = 1, ..., .ignore = "genes", .geneify = FALSE) feats0(.track_id = 1, ..., .ignore = NA, .geneify = FALSE) genes(..., .gene_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA")) links(.track_id = 1, ..., .ignore = NULL, .adjacent_only = NULL) seqs(...) bins(..., .group = vars()) track(.track_id = 1, ..., .track_type = NULL, .ignore = NULL) pull_feats(.x, .track_id = 1, ..., .ignore = "genes", .geneify = FALSE) pull_genes( .x, ..., .gene_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA") ) pull_links(.x, .track_id = 1, ..., .ignore = NULL, .adjacent_only = NULL) pull_seqs(.x, ...) pull_bins(.x, ..., .group = vars()) ## S3 method for class 'gggenomes_layout' pull_bins(.x, ..., .group = vars()) pull_track(.x, .track_id = 1, ..., .track_type = NULL, .ignore = NULL)
.track_id |
The track to pull out, either as a literal variable name or as a positive/negative integer giving the position from the left/right. |
... |
Logical predicates passed on to dplyr::filter. "seqs", "feats", "links". Affects position-based selection. |
.ignore |
track names to ignore when selecting by position.
Default is "genes", if using |
.geneify |
add dummy type, introns and geom_id column to play nicely with geoms supporting multi-level and spliced gene models. |
.gene_types |
return only feats of this type ( |
.adjacent_only |
filter for links connecting direct neighbors
( |
.group |
what variables to use in grouping of bins from seqs in addition
to |
.track_type |
restrict to these types of tracks - any combination of "seqs", "feats", "links". |
.x |
A gggenomes or gggenomes_layout object. |
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
A function that pulls the specified track from a gggenomes object.
feats()
: by default pulls out the first feat track not named
"genes".
feats0()
: by default pulls out the first feat track.
genes()
: pulls out the first feat track (genes), filtering
for records with type=="CDS"
, and adding a dummy gene_id
column if missing
to play nice with multi-exon geom
s.
links()
: by default pulls out the first link track.
seqs()
: pulls out the seqs track (there is only one).
bins()
: pulls out a binwise summary table of the seqs data powering
geom_bin_*()
calls. The bin table is not a real track, but recomputed
on-the-fly.
track()
: pulls from all tracks in order seqs, feats, links.
gg <- gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) gg %>% track_info() # info about track ids, positions and types # get first feat track that isn't "genes" (all equivalent) gg %>% pull_feats() # easiest gg %>% pull_feats(feats) # by id gg %>% pull_feats(1) # by position gg %>% pull_feats(2, .ignore = NULL) # default .ignore="genes" # get "seqs" track (always track #1) gg %>% pull_seqs() # plot integrated transposons and GC content for some viral genomes gg <- gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, GC = emale_gc)) gg + geom_seq() + geom_feat(color = "skyblue") + # defaults to data=feats() geom_line(aes(x, y + score - .6, group = y), data = feats(GC), color = "gray60")
gg <- gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) gg %>% track_info() # info about track ids, positions and types # get first feat track that isn't "genes" (all equivalent) gg %>% pull_feats() # easiest gg %>% pull_feats(feats) # by id gg %>% pull_feats(1) # by position gg %>% pull_feats(2, .ignore = NULL) # default .ignore="genes" # get "seqs" track (always track #1) gg %>% pull_seqs() # plot integrated transposons and GC content for some viral genomes gg <- gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, GC = emale_gc)) gg + geom_seq() + geom_feat(color = "skyblue") + # defaults to data=feats() geom_line(aes(x, y + score - .6, group = y), data = feats(GC), color = "gray60")
flip
and flip_seqs
reverse-complement specified bins or individual
sequences and their features. sync
automatically flips bins using a
heuristic that maximizes the amount of forward strand links between
neighboring bins.
flip(x, ..., .bin_track = seqs) flip_seqs(x, ..., .bins = everything(), .seq_track = seqs, .bin_track = seqs) sync(x, link_track = 1, min_support = 0)
flip(x, ..., .bin_track = seqs) flip_seqs(x, ..., .bins = everything(), .seq_track = seqs, .bin_track = seqs) sync(x, link_track = 1, min_support = 0)
x |
a gggenomes object |
... |
bins or sequences to flip in dplyr::select like syntax (numeric position or unquoted expressions) |
.bin_track , .seq_track
|
when using a function as selector such as
|
.bins |
preselection of bins with sequences to flip. Useful if selecting
by numeric position. It sets the context for selection, for example the
11th sequences of the total set might more easily described as the 2nd
sequences of the 3rd bin: |
link_track |
the link track to use for flipping bins nicely |
min_support |
only flip a bin if at least this many more nucleotides support an inversion over the given orientation |
For more details see the help vignette:
vignette("flip", package = "gggenomes")
a gggenomes object with flipped bins or sequences
library(patchwork) p <- gggenomes(genes = emale_genes) + geom_seq(aes(color = strand), arrow = TRUE) + geom_link(aes(fill = strand)) + expand_limits(color = c("-")) + labs(caption = "not flipped") # nothing flipped p0 <- p %>% add_links(emale_ava) # flip manually p1 <- p %>% add_links(emale_ava) %>% flip(4:6) + labs(caption = "manually") # flip automatically based on genome-genome links p2 <- p %>% add_links(emale_ava) %>% sync() + labs(caption = "genome alignments") # flip automatically based on protein-protein links p3 <- p %>% add_sublinks(emale_prot_ava) %>% sync() + labs(caption = "protein alignments") # flip automatically based on genes linked implicitly by belonging # to the same clusters of orthologs (or any grouping of your choice) p4 <- p %>% add_clusters(emale_cogs) %>% sync() + labs(caption = "shared orthologs") p0 + p1 + p2 + p3 + p4 + plot_layout(nrow = 1, guides = "collect")
library(patchwork) p <- gggenomes(genes = emale_genes) + geom_seq(aes(color = strand), arrow = TRUE) + geom_link(aes(fill = strand)) + expand_limits(color = c("-")) + labs(caption = "not flipped") # nothing flipped p0 <- p %>% add_links(emale_ava) # flip manually p1 <- p %>% add_links(emale_ava) %>% flip(4:6) + labs(caption = "manually") # flip automatically based on genome-genome links p2 <- p %>% add_links(emale_ava) %>% sync() + labs(caption = "genome alignments") # flip automatically based on protein-protein links p3 <- p %>% add_sublinks(emale_prot_ava) %>% sync() + labs(caption = "protein alignments") # flip automatically based on genes linked implicitly by belonging # to the same clusters of orthologs (or any grouping of your choice) p4 <- p %>% add_clusters(emale_cogs) %>% sync() + labs(caption = "shared orthologs") p0 + p1 + p2 + p3 + p4 + plot_layout(nrow = 1, guides = "collect")
Flip strand
flip_strand(strand, na = NA)
flip_strand(strand, na = NA)
strand |
some representation for strandedness |
na |
what to use for |
the strand flipped
Show loci containing features of interest. Loci can either be provided
as predefined regions directly (loci=
), or are constructed automatically
based on pre-selected features (via ...
). Features within max_dist
are
greedily combined into the same locus. locate()
adds these loci as new
track so that they can be easily visualized. focus()
extracts those loci
from their parent sequences making them the new sequence set. These sequences
will have their locus_id
as their new seq_id
.
focus( x, ..., .track_id = 2, .max_dist = 10000, .expand = 5000, .overhang = c("drop", "trim", "keep"), .locus_id = str_glue("{seq_id}_lc{row_number()}"), .locus_id_group = seq_id, .locus_bin = c("bin", "seq", "locus"), .locus_score = n(), .locus_filter = TRUE, .loci = NULL ) locate( x, ..., .track_id = 2, .max_dist = 10000, .expand = 5000, .locus_id = str_glue("{seq_id}_lc{row_number()}"), .locus_id_group = .data$seq_id, .locus_bin = c("bin", "seq", "locus"), .locus_score = n(), .locus_filter = TRUE, .locus_track = "loci" )
focus( x, ..., .track_id = 2, .max_dist = 10000, .expand = 5000, .overhang = c("drop", "trim", "keep"), .locus_id = str_glue("{seq_id}_lc{row_number()}"), .locus_id_group = seq_id, .locus_bin = c("bin", "seq", "locus"), .locus_score = n(), .locus_filter = TRUE, .loci = NULL ) locate( x, ..., .track_id = 2, .max_dist = 10000, .expand = 5000, .locus_id = str_glue("{seq_id}_lc{row_number()}"), .locus_id_group = .data$seq_id, .locus_bin = c("bin", "seq", "locus"), .locus_score = n(), .locus_filter = TRUE, .locus_track = "loci" )
x |
A gggenomes object |
... |
Logical predicates defined in terms of the variables in the track
given by The arguments in ‘...’ are automatically quoted and evaluated in the context of the data frame. They support unquoting and splicing. See ‘vignette("programming")’ for an introduction to these concepts. |
.track_id |
the track to filter from - defaults to first feature track, usually "genes". Can be a quoted or unquoted string or a positional argument giving the index of a track among all tracks (seqs, feats & links). |
.max_dist |
Maximum distance between adjacent features to be included into the same locus, default 10kb. |
.expand |
The amount to nucleotides to expand the focus around the target features. Default 2kb. Give two values for different up- and downstream expansions. |
.overhang |
How to handle features overlapping the locus boundaries (including expand). Options are to "keep" them, "trim" them exactly at the boundaries, or "drop" all features not fully included within the boundaries. |
.locus_id , .locus_id_group
|
How to generate the ids for the new loci
which will eventually become their new |
.locus_bin |
What bin to assign new locus to. Defaults to keeping the original binning, but can be set to the "seq" to bin all loci originating from the same parent sequence, or to "locus" to separate all loci into individual bins. |
.locus_score |
An expression evaluated in the context of all features
that are combined into a new locus. Results are stored in the column
|
.locus_filter |
An predicate expression used to post-filter identified
loci. Set |
.loci |
A data.frame specifying loci directly. Required columns are
|
.locus_track |
The name of the new track containing the identified loci. |
A gggenomes object focused on the desired loci
A gggenomes object with the new loci track added
focus()
: Identify regions of interest and zoom in on them
locate()
: Identify regions of interest and add them as new feature track
# Let's hunt some defense systems in marine SAGs # read the genomes s0 <- read_seqs(ex("gorg/gorg.fna.fai")) s1 <- s0 %>% # strip trailing number from contigs to get bins dplyr::mutate(bin_id = stringr::str_remove(seq_id, "_\\d+$")) # gene annotations from prokka g0 <- read_feats(ex("gorg/gorg.gff.xz")) # best hits to the PADS Arsenal database of prokaryotic defense-system genes # $ mmseqs easy-search gorg.fna pads-arsenal-v1-prf gorg-pads-defense.o6 /tmp \ # --greedy-best-hits f0 <- read_feats(ex("gorg/gorg-pads-defense.o6")) f1 <- f0 %>% # parser system/gene info tidyr::separate(seq_id2, into = c("seq_id2", "system", "gene"), sep = ",") %>% dplyr::filter( evalue < 1e-10, # get rid of some spurious hits # and let's focus just on a few systems for this example system %in% c("CRISPR-CAS", "DISARM", "GABIJA", "LAMASSU", "THOERIS") ) # plot the distribution of hits across full genomes gggenomes(g0, s1, f1, wrap = 2e5) + geom_seq() + geom_bin_label() + scale_color_brewer(palette = "Dark2") + geom_point(aes(x = x, y = y, color = system), data = feats()) # hilight the regions containing hits gggenomes(g0, s1, f1, wrap = 2e5) %>% locate(.track_id = feats) %>% identity() + geom_seq() + geom_bin_label() + scale_color_brewer(palette = "Dark2") + geom_feat(data = feats(loci), color = "plum3") + geom_point(aes(x = x, y = y, color = system), data = feats()) # zoom in on loci gggenomes(g0, s1, f1, wrap = 5e4) %>% focus(.track_id = feats) + geom_seq() + geom_bin_label() + geom_gene() + geom_feat(aes(color = system)) + geom_feat_tag(aes(label = gene)) + scale_color_brewer(palette = "Dark2")
# Let's hunt some defense systems in marine SAGs # read the genomes s0 <- read_seqs(ex("gorg/gorg.fna.fai")) s1 <- s0 %>% # strip trailing number from contigs to get bins dplyr::mutate(bin_id = stringr::str_remove(seq_id, "_\\d+$")) # gene annotations from prokka g0 <- read_feats(ex("gorg/gorg.gff.xz")) # best hits to the PADS Arsenal database of prokaryotic defense-system genes # $ mmseqs easy-search gorg.fna pads-arsenal-v1-prf gorg-pads-defense.o6 /tmp \ # --greedy-best-hits f0 <- read_feats(ex("gorg/gorg-pads-defense.o6")) f1 <- f0 %>% # parser system/gene info tidyr::separate(seq_id2, into = c("seq_id2", "system", "gene"), sep = ",") %>% dplyr::filter( evalue < 1e-10, # get rid of some spurious hits # and let's focus just on a few systems for this example system %in% c("CRISPR-CAS", "DISARM", "GABIJA", "LAMASSU", "THOERIS") ) # plot the distribution of hits across full genomes gggenomes(g0, s1, f1, wrap = 2e5) + geom_seq() + geom_bin_label() + scale_color_brewer(palette = "Dark2") + geom_point(aes(x = x, y = y, color = system), data = feats()) # hilight the regions containing hits gggenomes(g0, s1, f1, wrap = 2e5) %>% locate(.track_id = feats) %>% identity() + geom_seq() + geom_bin_label() + scale_color_brewer(palette = "Dark2") + geom_feat(data = feats(loci), color = "plum3") + geom_point(aes(x = x, y = y, color = system), data = feats()) # zoom in on loci gggenomes(g0, s1, f1, wrap = 5e4) %>% focus(.track_id = feats) + geom_seq() + geom_bin_label() + geom_gene() + geom_feat(aes(color = system)) + geom_feat_tag(aes(label = gene)) + scale_color_brewer(palette = "Dark2")
Put bin labels left of the sequences. nudge_left
adds space relative to the
total bin width between the label and the seqs, by default 5%. expand_left
expands the plot to the left by 20% to make labels visible.
geom_bin_label( mapping = NULL, data = bins(), hjust = 1, size = 3, nudge_left = 0.05, expand_left = 0.2, expand_x = NULL, expand_aes = NULL, yjust = 0, ... )
geom_bin_label( mapping = NULL, data = bins(), hjust = 1, size = 3, nudge_left = 0.05, expand_left = 0.2, expand_x = NULL, expand_aes = NULL, yjust = 0, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
hjust |
Moves the text horizontally |
size |
of the label |
nudge_left |
by this much relative to the widest bin |
expand_left |
by this much relative to the widest bin |
expand_x |
expand the plot to include this absolute x value |
expand_aes |
provide custom aes mappings for the expansion (advanced) |
yjust |
for multiline bins set to 0.5 to center labels on bins, and 1 to align labels to the bottom. |
... |
Other arguments passed on to
|
Set x
and expand_x
to an absolute position to align all labels at a
specific location
Bin labels are added as a text layer/component to the plot.
s0 <- read_seqs(list.files(ex("cafeteria"), "Cr.*\\.fa.fai$", full.names = TRUE)) s1 <- s0 %>% dplyr::filter(length > 5e5) gggenomes(emale_genes) + geom_seq() + geom_gene() + geom_bin_label() # make larger labels and extra room on the canvas gggenomes(emale_genes) + geom_seq() + geom_gene() + geom_bin_label(size = 7, expand_left = .4) # align labels for wrapped bins: # top gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label() + geom_seq_label() # center gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label(yjust = .5) + geom_seq_label() # bottom gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label(yjust = 1) + geom_seq_label()
s0 <- read_seqs(list.files(ex("cafeteria"), "Cr.*\\.fa.fai$", full.names = TRUE)) s1 <- s0 %>% dplyr::filter(length > 5e5) gggenomes(emale_genes) + geom_seq() + geom_gene() + geom_bin_label() # make larger labels and extra room on the canvas gggenomes(emale_genes) + geom_seq() + geom_gene() + geom_bin_label(size = 7, expand_left = .4) # align labels for wrapped bins: # top gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label() + geom_seq_label() # center gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label(yjust = .5) + geom_seq_label() # bottom gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label(yjust = 1) + geom_seq_label()
Visualize data that varies along sequences as ribbons, lines, lineranges, etc.
geom_coverage( mapping = NULL, data = feats(), stat = "coverage", geom = "ribbon", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, height = 0.2, max = base::max, ... ) geom_wiggle( mapping = NULL, data = feats(), stat = "wiggle", geom = "ribbon", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, height = 0.8, bounds = Hmisc::smedian.hilow, ... )
geom_coverage( mapping = NULL, data = feats(), stat = "coverage", geom = "ribbon", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, height = 0.2, max = base::max, ... ) geom_wiggle( mapping = NULL, data = feats(), stat = "wiggle", geom = "ribbon", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, height = 0.8, bounds = Hmisc::smedian.hilow, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
geom |
The geometric object to use to display the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
offset |
distance between seq center and wiggle mid/start. |
height |
distance in plot between lowest and highest point of the wiggle data. |
max |
geom_coverage uses the function base::max by default, which plots data in positive direction. (base::min Can also be called here when the input data ) |
... |
Other arguments passed on to
|
bounds |
geom_wiggle uses mid, low and high boundary values for plotting wiggle data. Can be both a function or a vector returning those three values. Defaults to Hmisc::smedian.hilow. |
Geom_wiggle plots the wiggle data in both directions around the median. Geom_coverage plots the data only in positive direction. Both functions use data from the feats' track.
A ggplot2 layer with coverage information.
geom_wiggle()
and geom_coverage()
understand aesthetics depending on the
chosen underlying ggplot geom, by default ggplot2::geom_ribbon()
. Other
options that play well are for example ggplot2::geom_line()
,
ggplot2::geom_linerange()
, ggplot2::geom_point()
. The only required
aesthetic is:
z
# Plotting data with geom_coverage with increased height. gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = score), height = 0.5) + geom_seq() # In opposite direction by calling base::min and taking the negative values of "score" gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = -score), max = base::min, height = 0.5) + geom_seq() # GC-content plotted as points with variable color in geom_coverage gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = score, color = score), height = 0.5, geom = "point") + geom_seq() # Plot varying GC-content along sequences as ribbon gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score)) + geom_seq() # customize color and position gggenomes(genes = emale_genes, seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score), fill = "lavenderblush3", offset = -.3, height = .5) + geom_seq() + geom_gene() # GC-content as line and with variable color gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score, color = score), geom = "line", bounds = c(.5, 0, 1)) + geom_seq() + scale_colour_viridis_b(option = "A") # or as lineranges gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score, color = score), geom = "linerange") + geom_seq() + scale_colour_viridis_b(option = "A")
# Plotting data with geom_coverage with increased height. gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = score), height = 0.5) + geom_seq() # In opposite direction by calling base::min and taking the negative values of "score" gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = -score), max = base::min, height = 0.5) + geom_seq() # GC-content plotted as points with variable color in geom_coverage gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_coverage(aes(z = score, color = score), height = 0.5, geom = "point") + geom_seq() # Plot varying GC-content along sequences as ribbon gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score)) + geom_seq() # customize color and position gggenomes(genes = emale_genes, seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score), fill = "lavenderblush3", offset = -.3, height = .5) + geom_seq() + geom_gene() # GC-content as line and with variable color gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score, color = score), geom = "line", bounds = c(.5, 0, 1)) + geom_seq() + scale_colour_viridis_b(option = "A") # or as lineranges gggenomes(seqs = emale_seqs, feats = emale_gc) + geom_wiggle(aes(z = score, color = score), geom = "linerange") + geom_seq() + scale_colour_viridis_b(option = "A")
geom_feat()
allows the user to draw (additional) features to the plot/graph.
For example, specific regions within a sequence (e.g. transposons, introns, mutation hotspots)
can be highlighted by color, size, etc..
geom_feat( mapping = NULL, data = feats(), stat = "identity", position = "pile", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
geom_feat( mapping = NULL, data = feats(), stat = "identity", position = "pile", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
mapping |
Set of aesthetic mappings created by |
data |
feat_layout: Uses first data frame stored in the |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
position |
describes how the position of different plotted features are adjusted. By default it uses |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
... |
Other arguments passed on to
|
geom_feat
uses ggplot2::geom_segment
under the hood. As a result,
different aesthetics such as alpha, linewidth, color, etc.
can be called upon to modify the visualization of the data.
By default, the function uses the first feature track.
A ggplot2 layer with features.
# Plotting data from the feats' track with adjusted linewidth and color gggenomes(seqs = emale_seqs, feats = emale_ngaros) + geom_seq() + geom_feat(linewidth = 5, color = "darkred") # Geom_feat can be called several times as well, when specified what data should be used gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, emale_tirs)) + geom_seq() + geom_feat(linewidth = 5, color = "darkred") + # uses first feature track geom_feat(data = feats(emale_tirs)) # Additional notes to feats can be added with functions such as: geom_feat_note / geom_feat_text gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, emale_tirs)) + geom_seq() + geom_feat(color = "darkred") + geom_feat(data = feats(emale_tirs), color = "darkblue") + geom_feat_note(data = feats(emale_ngaros), label = "repeat region", size = 4) # Different position adjustments with a simple dataset exampledata <- tibble::tibble( seq_id = c(rep("A", 3), rep("B", 3), rep("C", 3)), start = c(0, 30, 15, 40, 80, 20, 30, 50, 70), end = c(30, 90, 60, 60, 100, 80, 60, 90, 120) ) gggenomes(feats = exampledata) + geom_feat(position = "identity", alpha = 0.5, linewidth = 0.5) + geom_bin_label()
# Plotting data from the feats' track with adjusted linewidth and color gggenomes(seqs = emale_seqs, feats = emale_ngaros) + geom_seq() + geom_feat(linewidth = 5, color = "darkred") # Geom_feat can be called several times as well, when specified what data should be used gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, emale_tirs)) + geom_seq() + geom_feat(linewidth = 5, color = "darkred") + # uses first feature track geom_feat(data = feats(emale_tirs)) # Additional notes to feats can be added with functions such as: geom_feat_note / geom_feat_text gggenomes(seqs = emale_seqs, feats = list(emale_ngaros, emale_tirs)) + geom_seq() + geom_feat(color = "darkred") + geom_feat(data = feats(emale_tirs), color = "darkblue") + geom_feat_note(data = feats(emale_ngaros), label = "repeat region", size = 4) # Different position adjustments with a simple dataset exampledata <- tibble::tibble( seq_id = c(rep("A", 3), rep("B", 3), rep("C", 3)), start = c(0, 30, 15, 40, 80, 20, 30, 50, 70), end = c(30, 90, 60, 60, 100, 80, 60, 90, 120) ) gggenomes(feats = exampledata) + geom_feat(position = "identity", alpha = 0.5, linewidth = 0.5) + geom_bin_label()
The functions below are useful for labeling features/genes in plots.
Users have to call on aes(label = ...)
or (label = ...)
to define label's text
Based on the function, the label will be placed at a specific location:
geom_..._text()
will plot text in the middle of the feature.
geom_..._tag()
will plot text on top of the feature, with a 45 degree angle.
geom_..._note()
will plot text under the feature at the left side.
The ...
can be either replaced with feat
or gene
depending on which
track the user wants to label.
With arguments such as hjust
, vjust
, angle
, and nudge_y
, the user
can also manually change the position of the text.
geom_feat_text( mapping = NULL, data = feats(), stat = "identity", position = "identity", ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_feat_tag( mapping = NULL, data = feats(), stat = "identity", position = "identity", hjust = 0, vjust = 0, angle = 45, nudge_y = 0.03, xjust = 0.5, strandwise = TRUE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_feat_note( mapping = NULL, data = feats(), stat = "identity", position = "identity", hjust = 0, vjust = 1, nudge_y = -0.03, xjust = 0, strandwise = FALSE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_text( mapping = NULL, data = genes(), stat = "identity", position = "identity", ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_tag( mapping = NULL, data = genes(), stat = "identity", position = "identity", hjust = 0, vjust = 0, angle = 45, nudge_y = 0.03, xjust = 0.5, strandwise = TRUE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_note( mapping = NULL, data = genes(), stat = "identity", position = "identity", hjust = 0, vjust = 1, nudge_y = -0.03, xjust = 0, strandwise = FALSE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_feat_text( mapping = NULL, data = feats(), stat = "identity", position = "identity", ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_feat_tag( mapping = NULL, data = feats(), stat = "identity", position = "identity", hjust = 0, vjust = 0, angle = 45, nudge_y = 0.03, xjust = 0.5, strandwise = TRUE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_feat_note( mapping = NULL, data = feats(), stat = "identity", position = "identity", hjust = 0, vjust = 1, nudge_y = -0.03, xjust = 0, strandwise = FALSE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_text( mapping = NULL, data = genes(), stat = "identity", position = "identity", ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_tag( mapping = NULL, data = genes(), stat = "identity", position = "identity", hjust = 0, vjust = 0, angle = 45, nudge_y = 0.03, xjust = 0.5, strandwise = TRUE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE ) geom_gene_note( mapping = NULL, data = genes(), stat = "identity", position = "identity", hjust = 0, vjust = 1, nudge_y = -0.03, xjust = 0, strandwise = FALSE, ..., parse = FALSE, check_overlap = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer.
Cannot be jointy specified with
|
... |
Other arguments passed on to
|
parse |
If |
check_overlap |
If |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
hjust |
Moves the text horizontally |
vjust |
Moves the text vertically |
angle |
Defines the angle in which the text will be placed. *Note |
nudge_y |
Moves the text vertically an entire contig/sequence.
(e.g. |
xjust |
Move text in x direction |
strandwise |
plotting of feature tags |
These labeling functions use ggplot2::geom_text()
under the hood.
Any changes to the aesthetics of the text can be performed in a ggplot2 manner.
A ggplot2 layer with gene text.
A ggplot2 layer with feature tags.
A ggplot2 layer with feature notes.
A ggplot2 layer with gene text.
A ggplot2 layer with gene tags.
A ggplot2 layer with gene notes.
# example data genes <- tibble::tibble( seq_id = c("A", "A", "A", "B", "B", "C"), start = c(20, 40, 80, 30, 10, 60), end = c(30, 70, 85, 40, 15, 90), feat_id = c("A1", "A2", "A3", "B1", "B2", "C1"), type = c("CDS", "CDS", "CDS", "CDS", "CDS", "CDS"), name = c("geneA", "geneB", "geneC", "geneA", "geneC", "geneB") ) seqs <- tibble::tibble( seq_id = c("A", "B", "C"), start = c(0, 0, 0), end = c(100, 100, 100), length = c(100, 100, 100) ) # basic plot creation plot <- gggenomes(seqs = seqs, genes = genes) + geom_bin_label() + geom_gene() # geom_..._text plot + geom_gene_text(aes(label = name)) # geom_..._tag plot + geom_gene_tag(aes(label = name)) # geom_..._note plot + geom_gene_note(aes(label = name)) # with horizontal adjustment (`hjust`), vertical adjustment (`vjust`) plot + geom_gene_text(aes(label = name), vjust = -2, hjust = 1) # using `nudge_y` and and `angle` adjustment plot + geom_gene_text(aes(label = name), nudge_y = 1, angle = 10) # labeling with manual input plot + geom_gene_text(label = c("This", "is", "an", "example", "test", "test"))
# example data genes <- tibble::tibble( seq_id = c("A", "A", "A", "B", "B", "C"), start = c(20, 40, 80, 30, 10, 60), end = c(30, 70, 85, 40, 15, 90), feat_id = c("A1", "A2", "A3", "B1", "B2", "C1"), type = c("CDS", "CDS", "CDS", "CDS", "CDS", "CDS"), name = c("geneA", "geneB", "geneC", "geneA", "geneC", "geneB") ) seqs <- tibble::tibble( seq_id = c("A", "B", "C"), start = c(0, 0, 0), end = c(100, 100, 100), length = c(100, 100, 100) ) # basic plot creation plot <- gggenomes(seqs = seqs, genes = genes) + geom_bin_label() + geom_gene() # geom_..._text plot + geom_gene_text(aes(label = name)) # geom_..._tag plot + geom_gene_tag(aes(label = name)) # geom_..._note plot + geom_gene_note(aes(label = name)) # with horizontal adjustment (`hjust`), vertical adjustment (`vjust`) plot + geom_gene_text(aes(label = name), vjust = -2, hjust = 1) # using `nudge_y` and and `angle` adjustment plot + geom_gene_text(aes(label = name), nudge_y = 1, angle = 10) # labeling with manual input plot + geom_gene_text(label = c("This", "is", "an", "example", "test", "test"))
Draw coding sequences, mRNAs and other non-coding features. Supports
multi-exon features. CDS and mRNAs in the same group are plotted together.
They can therefore also be positioned as a single unit using the position
argument.
geom_gene( mapping = NULL, data = genes(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, size = 2, rna_size = size, shape = size, rna_shape = shape, intron_shape = size, intron_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA"), cds_aes = NULL, rna_aes = NULL, intron_aes = NULL, ... )
geom_gene( mapping = NULL, data = genes(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, size = 2, rna_size = size, shape = size, rna_shape = shape, intron_shape = size, intron_types = c("CDS", "mRNA", "tRNA", "tmRNA", "ncRNA", "rRNA"), cds_aes = NULL, rna_aes = NULL, intron_aes = NULL, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
na.rm |
remove na values |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
size , rna_size
|
the size of the gene model, aka the height of the
polygons. |
shape , rna_shape
|
vector of height and width of the arrow tip, defaults
to size. If only one value is provided it is recycled. Set '0' to
deactivates arrow-shaped tips. |
intron_shape |
single value controlling the kink of the intron line. Defaults to size. Set 0 for straight lines between exons. |
intron_types |
introns will only be computed/drawn for features with types listed here. Set to "CDS" to plot mRNAs as continous features, and set to NA to completely ignore introns. |
cds_aes , rna_aes , intron_aes
|
overwrite aesthetics for different model
parts. Need to be wrapped in |
... |
passed to layer params |
A ggplot2 layer with genes.
geom_gene()
understands the following aesthetics (required aesthetics are in bold):
Learn more about setting these aesthetics in vignette("ggplot2-specs")
.
'type' and 'group' (mapped to 'type' and 'geom_id' by default) power the proper recognition of CDS and their corresponding mRNAs so that they can be drawn as one composite object. Overwrite 'group' to plot CDS and mRNAs independently.
'introns' (mapped to 'introns') is used to compute intron/exon boundaries.
Use the parameter intron_types
if you want to disable introns.
gggenomes(genes = emale_genes) + geom_gene() gggenomes(genes = emale_genes) + geom_gene(aes(fill = as.numeric(gc_content)), position = "strand") + scale_fill_viridis_b() g0 <- read_gff3(ex("eden-utr.gff")) gggenomes(genes = g0) + # all features in the "genes" regardless of type geom_feat(data = feats(genes)) + annotate("text", label = "geom_feat", x = -15, y = .9) + xlim(-20, NA) + # only features in the "genes" of geneish type (implicit `data=genes()`) geom_gene() + geom_gene_tag(aes(label = ifelse(is.na(type), "<NA>", type)), data = genes(.gene_types = NULL)) + annotate("text", label = "geom_gene", x = -15, y = 1) + # control which types are returned from the track geom_gene(aes(y = 1.1), data = genes(.gene_types = c("CDS", "misc_RNA"))) + annotate("text", label = "gene_types", x = -15, y = 1.1) + # control which types can have introns geom_gene( aes(y = 1.2, yend = 1.2), data = genes(.gene_types = c("CDS", "misc_RNA")), intron_types = "misc_RNA" ) + annotate("text", label = "intron_types", x = -15, y = 1.2) # spliced genes library(patchwork) gg <- gggenomes(genes = g0) gg + geom_gene(position = "pile") + gg + geom_gene(aes(fill = type), position = "pile", shape = 0, intron_shape = 0, color = "white" ) + # some fine-control on cds/rna/intron after_scale aesthetics gg + geom_gene(aes(fill = geom_id), position = "pile", size = 2, shape = c(4, 3), rna_size = 2, intron_shape = 4, stroke = 0, cds_aes = aes(fill = "black"), rna_aes = aes(fill = fill), intron_aes = aes(colour = fill, stroke = 2) ) + scale_fill_viridis_d() + # fun with introns gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4)) + gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4), intron_types = c() ) + gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4), intron_types = "CDS" )
gggenomes(genes = emale_genes) + geom_gene() gggenomes(genes = emale_genes) + geom_gene(aes(fill = as.numeric(gc_content)), position = "strand") + scale_fill_viridis_b() g0 <- read_gff3(ex("eden-utr.gff")) gggenomes(genes = g0) + # all features in the "genes" regardless of type geom_feat(data = feats(genes)) + annotate("text", label = "geom_feat", x = -15, y = .9) + xlim(-20, NA) + # only features in the "genes" of geneish type (implicit `data=genes()`) geom_gene() + geom_gene_tag(aes(label = ifelse(is.na(type), "<NA>", type)), data = genes(.gene_types = NULL)) + annotate("text", label = "geom_gene", x = -15, y = 1) + # control which types are returned from the track geom_gene(aes(y = 1.1), data = genes(.gene_types = c("CDS", "misc_RNA"))) + annotate("text", label = "gene_types", x = -15, y = 1.1) + # control which types can have introns geom_gene( aes(y = 1.2, yend = 1.2), data = genes(.gene_types = c("CDS", "misc_RNA")), intron_types = "misc_RNA" ) + annotate("text", label = "intron_types", x = -15, y = 1.2) # spliced genes library(patchwork) gg <- gggenomes(genes = g0) gg + geom_gene(position = "pile") + gg + geom_gene(aes(fill = type), position = "pile", shape = 0, intron_shape = 0, color = "white" ) + # some fine-control on cds/rna/intron after_scale aesthetics gg + geom_gene(aes(fill = geom_id), position = "pile", size = 2, shape = c(4, 3), rna_size = 2, intron_shape = 4, stroke = 0, cds_aes = aes(fill = "black"), rna_aes = aes(fill = fill), intron_aes = aes(colour = fill, stroke = 2) ) + scale_fill_viridis_d() + # fun with introns gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4)) + gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4), intron_types = c() ) + gg + geom_gene(aes(fill = geom_id), position = "pile", size = 3, shape = c(4, 4), intron_types = "CDS" )
These geom_..._label()
functions able the user to plot labels/text at individual features and/or links.
Users have to indicate how to label the features/links by specifying label = ...
or aes(label = ...
Position of labels can be adjusted with arguments such as vjust
, hjust
, angle
, nudge_y
, etc.
Also check out geom_bin_label()
, geom_seq_label()
or geom_feat_text()
given their resemblance.
geom_gene_label( mapping = NULL, data = genes(), angle = 45, hjust = 0, nudge_y = 0.1, size = 6, ... ) geom_feat_label( mapping = NULL, data = feats(), angle = 45, hjust = 0, nudge_y = 0.1, size = 6, ... ) geom_link_label( mapping = NULL, data = links(), angle = 0, hjust = 0.5, vjust = 0.5, size = 4, repel = FALSE, ... )
geom_gene_label( mapping = NULL, data = genes(), angle = 45, hjust = 0, nudge_y = 0.1, size = 6, ... ) geom_feat_label( mapping = NULL, data = feats(), angle = 45, hjust = 0, nudge_y = 0.1, size = 6, ... ) geom_link_label( mapping = NULL, data = links(), angle = 0, hjust = 0.5, vjust = 0.5, size = 4, repel = FALSE, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
angle |
Defines the angle in which the text will be placed. *Note |
hjust |
Moves the text horizontally |
nudge_y |
Moves the text vertically an entire contig/sequence.
(e.g. |
size |
of the label |
... |
Other arguments passed on to
|
vjust |
Moves the text vertically |
repel |
use ggrepel to avoid overlaps |
These labeling functions use ggplot2::geom_text()
under the hood.
Any changes to the aesthetics of the text can be performed in a ggplot2 manner.
Gene labels are added as a text layer/component to the plot.
Draws connections between genomes, such as genome/gene/protein
alignments and gene/protein clusters. geom_link()
draws links as filled
polygons, geom_link_line()
draws a single connecting line.
Note that by default only links between adjacent genomes are computed and
shown. To compute and show all links between all genomes, set
gggenomes(..., adjacent_only=FALSE)
.
geom_link( mapping = NULL, data = links(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0.15, ... ) geom_link_line( mapping = NULL, data = links(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
geom_link( mapping = NULL, data = links(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0.15, ... ) geom_link_line( mapping = NULL, data = links(), stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
offset |
distance between seq center and link start. Use two values
|
... |
Other arguments passed on to
|
The function calls upon the data stored within the link
track.
Data frames added to this track have seq_id
and seq_id2
as required
variables. Optional and recommended variables include start
, start2
,
end
, end2
, bin_id
, bin_id2
and strand
.
Note, when start/end is not specified, links will be created between the
entire contigs of seq_id
and seq_id2
.
A ggplot2 layer with links.
p0 <- gggenomes(seqs = emale_seqs, links = emale_ava) + geom_seq() # default links p1 <- p0 + geom_link() # change offset from seqs and color p2 <- p0 + geom_link(aes(fill = de, color = de), offset = 0.05) + scale_fill_viridis_b() + scale_colour_viridis_b() # combine with flip p3 <- p0 |> flip(3, 4, 5) + geom_link() # compute & show all links among all genomes # usually not useful and not recommended for large dataset p4 <- gggenomes(links = emale_ava, adjacent_only = FALSE) + geom_link() library(patchwork) # combine plots in one figure p1 + p2 + p3 + p4 + plot_layout(nrow = 1) q0 <- gggenomes(emale_genes, emale_seqs) |> add_clusters(emale_cogs) + geom_seq() + geom_gene() # link gene clusters with polygon q1 <- q0 + geom_link(aes(fill = cluster_id)) # link gene clusters with lines q2 <- q0 + geom_link_line(aes(color = cluster_id)) q1 + q2 + plot_layout(nrow = 1, guides = "collect")
p0 <- gggenomes(seqs = emale_seqs, links = emale_ava) + geom_seq() # default links p1 <- p0 + geom_link() # change offset from seqs and color p2 <- p0 + geom_link(aes(fill = de, color = de), offset = 0.05) + scale_fill_viridis_b() + scale_colour_viridis_b() # combine with flip p3 <- p0 |> flip(3, 4, 5) + geom_link() # compute & show all links among all genomes # usually not useful and not recommended for large dataset p4 <- gggenomes(links = emale_ava, adjacent_only = FALSE) + geom_link() library(patchwork) # combine plots in one figure p1 + p2 + p3 + p4 + plot_layout(nrow = 1) q0 <- gggenomes(emale_genes, emale_seqs) |> add_clusters(emale_cogs) + geom_seq() + geom_gene() # link gene clusters with polygon q1 <- q0 + geom_link(aes(fill = cluster_id)) # link gene clusters with lines q2 <- q0 + geom_link_line(aes(color = cluster_id)) q1 + q2 + plot_layout(nrow = 1, guides = "collect")
geom_seq()
draws contigs for each sequence/chromosome supplied in the seqs
track.
Several sequences belonging to the same bin will be plotted next to one another.
If seqs
track is empty, sequences are inferred from the feats
or links
track respectively.
(The length of sequences can be deduced from the axis and is typically indicated in base pairs.)
geom_seq(mapping = NULL, data = seqs(), arrow = NULL, ...)
geom_seq(mapping = NULL, data = seqs(), arrow = NULL, ...)
mapping |
Set of aesthetic mappings created by |
data |
seq_layout: Uses the first data frame stored in the |
arrow |
set to non-NULL to generate default arrows |
... |
Other arguments passed on to
|
geom_seq()
uses ggplot2::geom_segment()
under the hood. As a result,
different aesthetics such as alpha, linewidth, color, etc.
can be called upon to modify the visualization of the data.
Note: The seqs
track indicates the length/region of the sequence/contigs that will be plotted.
Feats or links data that falls outside of this region are ignored!
Sequence data drawn as contigs is added as a layer/component to the plot.
# Simple example of geom_seq gggenomes(seqs = emale_seqs) + geom_seq() + # creates contigs geom_bin_label() # labels bins/sequences # No sequence information supplied, will inform/warn that seqs are inferred from feats. gggenomes(genes = emale_genes) + geom_seq() + # creates contigs geom_gene() + # draws genes on top of contigs geom_bin_label() # labels bins/sequences # Sequence data controls what sequences and/or regions will be plotted. # Here one sequence is filtered out, Notice that the genes of the removed # sequence are silently ignored and thus not plotted. missing_seqs <- emale_seqs |> dplyr::filter(seq_id != "Cflag_017B") |> dplyr::arrange(seq_id) # `arrange` to restore alphabetical order. gggenomes(seqs = missing_seqs, genes = emale_genes) + geom_seq() + # creates contigs geom_gene() + # draws genes on top of contigs geom_bin_label() # labels bins/sequences # Several sequences belonging to the same *bin* are plotted next to one another seqs <- tibble::tibble( bin_id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"), seq_id = c("A1", "A2", "A3", "B1", "B2", "B3", "B4", "C1", "C2"), start = c(0, 100, 200, 0, 50, 150, 250, 0, 400), end = c(100, 200, 400, 50, 100, 250, 300, 300, 500), length = c(100, 100, 200, 50, 50, 100, 50, 300, 100) ) gggenomes(seqs = seqs) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences # Wrap bins uptill a certain amount. gggenomes(seqs = seqs, wrap = 300) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences # Change the space between sequences belonging to one bin gggenomes(seqs = seqs, spacing = 100) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences
# Simple example of geom_seq gggenomes(seqs = emale_seqs) + geom_seq() + # creates contigs geom_bin_label() # labels bins/sequences # No sequence information supplied, will inform/warn that seqs are inferred from feats. gggenomes(genes = emale_genes) + geom_seq() + # creates contigs geom_gene() + # draws genes on top of contigs geom_bin_label() # labels bins/sequences # Sequence data controls what sequences and/or regions will be plotted. # Here one sequence is filtered out, Notice that the genes of the removed # sequence are silently ignored and thus not plotted. missing_seqs <- emale_seqs |> dplyr::filter(seq_id != "Cflag_017B") |> dplyr::arrange(seq_id) # `arrange` to restore alphabetical order. gggenomes(seqs = missing_seqs, genes = emale_genes) + geom_seq() + # creates contigs geom_gene() + # draws genes on top of contigs geom_bin_label() # labels bins/sequences # Several sequences belonging to the same *bin* are plotted next to one another seqs <- tibble::tibble( bin_id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"), seq_id = c("A1", "A2", "A3", "B1", "B2", "B3", "B4", "C1", "C2"), start = c(0, 100, 200, 0, 50, 150, 250, 0, 400), end = c(100, 200, 400, 50, 100, 250, 300, 300, 500), length = c(100, 100, 200, 50, 50, 100, 50, 300, 100) ) gggenomes(seqs = seqs) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences # Wrap bins uptill a certain amount. gggenomes(seqs = seqs, wrap = 300) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences # Change the space between sequences belonging to one bin gggenomes(seqs = seqs, spacing = 100) + geom_seq() + geom_bin_label() + # label bins geom_seq_label() # label individual sequences
geom_seq_break()
adds decorations to the ends of truncated sequences. These
could arise from zooming onto sequence loci with focus()
, or manually
annotating sequences with start > 1 and/or end < length.
geom_seq_break( mapping_start = NULL, mapping_end = NULL, data_start = seqs(start > 1), data_end = seqs(end < length), label = "/", size = 4, hjust = 0.75, family = "sans", stat = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
geom_seq_break( mapping_start = NULL, mapping_end = NULL, data_start = seqs(start > 1), data_end = seqs(end < length), label = "/", size = 4, hjust = 0.75, family = "sans", stat = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
mapping_start |
optional start mapping |
mapping_end |
optional end mapping |
data_start |
seq_layout of sequences for which to decorate the start.
default: |
data_end |
seq_layout of sequences for which to decorate the end.
default: |
label |
the character to decorate ends with. Provide two values for
different start and end decorations, e.g. |
size |
of the text |
hjust |
Moves the text horizontally |
family |
font family of the text |
stat |
The statistical transformation to use on the data for this layer.
When using a
|
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
... |
Other arguments passed on to
|
A ggplot2 layer with sequence breaks.
# decorate breaks created with focus() gggenomes(emale_genes, emale_seqs) |> focus(.expand = 1e3, .max_dist = 1e3) + geom_seq() + geom_gene() + geom_seq_break() # customize decorations gggenomes(emale_genes, emale_seqs) |> focus(.expand = 1e3, .max_dist = 1e3) + geom_seq() + geom_gene() + geom_seq_break(label = c("[", "]"), size = 3, color = "#1b9e77") # decorate manually truncated sequences s0 <- tibble::tribble( # start/end define regions, i.e. truncated contigs ~bin_id, ~seq_id, ~length, ~start, ~end, "complete_genome", "chromosome_1_long_trunc_2side", 1e5, 1e4, 2.1e4, "fragmented_assembly", "contig_1_trunc_1side", 1.3e4, .9e4, 1.3e4, "fragmented_assembly", "contig_2_short_complete", 0.3e4, 1, 0.3e4, "fragmented_assembly", "contig_3_trunc_2sides", 2e4, 1e4, 1.4e4 ) l0 <- tibble::tribble( ~seq_id, ~start, ~end, ~seq_id2, ~start2, ~end2, "chromosome_1_long_trunc_2side", 1.1e4, 1.4e4, "contig_1_trunc_1side", 1e4, 1.3e4, "chromosome_1_long_trunc_2side", 1.4e4, 1.7e4, "contig_2_short_complete", 1, 0.3e4, "chromosome_1_long_trunc_2side", 1.7e4, 2e4, "contig_3_trunc_2sides", 1e4, 1.3e4 ) gggenomes(seqs = s0, links = l0) + geom_seq() + geom_link() + geom_seq_label(nudge_y = -.05) + geom_seq_break()
# decorate breaks created with focus() gggenomes(emale_genes, emale_seqs) |> focus(.expand = 1e3, .max_dist = 1e3) + geom_seq() + geom_gene() + geom_seq_break() # customize decorations gggenomes(emale_genes, emale_seqs) |> focus(.expand = 1e3, .max_dist = 1e3) + geom_seq() + geom_gene() + geom_seq_break(label = c("[", "]"), size = 3, color = "#1b9e77") # decorate manually truncated sequences s0 <- tibble::tribble( # start/end define regions, i.e. truncated contigs ~bin_id, ~seq_id, ~length, ~start, ~end, "complete_genome", "chromosome_1_long_trunc_2side", 1e5, 1e4, 2.1e4, "fragmented_assembly", "contig_1_trunc_1side", 1.3e4, .9e4, 1.3e4, "fragmented_assembly", "contig_2_short_complete", 0.3e4, 1, 0.3e4, "fragmented_assembly", "contig_3_trunc_2sides", 2e4, 1e4, 1.4e4 ) l0 <- tibble::tribble( ~seq_id, ~start, ~end, ~seq_id2, ~start2, ~end2, "chromosome_1_long_trunc_2side", 1.1e4, 1.4e4, "contig_1_trunc_1side", 1e4, 1.3e4, "chromosome_1_long_trunc_2side", 1.4e4, 1.7e4, "contig_2_short_complete", 1, 0.3e4, "chromosome_1_long_trunc_2side", 1.7e4, 2e4, "contig_3_trunc_2sides", 1e4, 1.3e4 ) gggenomes(seqs = s0, links = l0) + geom_seq() + geom_link() + geom_seq_label(nudge_y = -.05) + geom_seq_break()
This function will put labels at each individual sequence.
By default it will plot the seq_id
as label, but users are able to change this manually.
Position of the label/text can be adjusted with the different arguments (e.g. vjust
, hjust
, angle
, etc.)
geom_seq_label( mapping = NULL, data = seqs(), hjust = 0, vjust = 1, nudge_y = -0.15, size = 2.5, ... )
geom_seq_label( mapping = NULL, data = seqs(), hjust = 0, vjust = 1, nudge_y = -0.15, size = 2.5, ... )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
hjust |
Moves the text horizontally |
vjust |
Moves the text vertically |
nudge_y |
Moves the text vertically an entire contig/sequence.
(e.g. |
size |
of the label |
... |
Other arguments passed on to
|
This labeling function uses ggplot2::geom_text()
under the hood.
Any changes to the aesthetics of the text can be performed in a ggplot2 manner.
Sequence labels are added as a text layer/component to the plot.
# example data seqs <- tibble::tibble( bin_id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"), seq_id = c("A1", "A2", "A3", "B1", "B2", "B3", "B4", "C1", "C2"), start = c(0, 100, 200, 0, 50, 150, 250, 0, 400), end = c(100, 200, 400, 50, 100, 250, 300, 300, 500), length = c(100, 100, 200, 50, 50, 100, 50, 300, 100) ) # example plot using geom_seq_label gggenomes(seqs = seqs) + geom_seq() + geom_seq_label() # changing default label to `length` column gggenomes(seqs = seqs) + geom_seq() + geom_seq_label(aes(label = length)) # with horizontal adjustment gggenomes(seqs = seqs) + geom_seq() + geom_seq_label(hjust = -5) # with wrapping at 300 gggenomes(seqs = seqs, wrap = 300) + geom_seq() + geom_seq_label()
# example data seqs <- tibble::tibble( bin_id = c("A", "A", "A", "B", "B", "B", "B", "C", "C"), seq_id = c("A1", "A2", "A3", "B1", "B2", "B3", "B4", "C1", "C2"), start = c(0, 100, 200, 0, 50, 150, 250, 0, 400), end = c(100, 200, 400, 50, 100, 250, 300, 300, 500), length = c(100, 100, 200, 50, 50, 100, 50, 300, 100) ) # example plot using geom_seq_label gggenomes(seqs = seqs) + geom_seq() + geom_seq_label() # changing default label to `length` column gggenomes(seqs = seqs) + geom_seq() + geom_seq_label(aes(label = length)) # with horizontal adjustment gggenomes(seqs = seqs) + geom_seq() + geom_seq_label(hjust = -5) # with wrapping at 300 gggenomes(seqs = seqs, wrap = 300) + geom_seq() + geom_seq_label()
geom_variant allows the user to draw points at locations where a mutation has occured. Data on SNPs, Insertions, Deletions and more (often stored in a variant call format (VCF)) can easily be visualized this way.
geom_variant( mapping = NULL, data = feats(), stat = "identity", position = "identity", geom = "variant", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, ... )
geom_variant( mapping = NULL, data = feats(), stat = "identity", position = "identity", geom = "variant", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, offset = 0, ... )
mapping |
Set of aesthetic mappings created by |
data |
Data from the first feats track is used for this function by default. When several feats tracks are present within the gggenomes track system,
make sure that the wanted data is used by calling |
stat |
Describes what statistical transformation is used for this layer. By default it uses |
position |
Describes how the position of different plotted features are adjusted. By default it uses |
geom |
Describes what geom is called upon by the function for plotting. By default the function uses |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
offset |
Numeric value describing how far the points will be drawn from the base/sequence. By default it is set on |
... |
Other arguments passed on to
|
geom_variant uses ggplot2::geom_point
under the hood. As a result, different aesthetics such as alpha
, size
, color
, etc.
can be called upon to modify the data visualization.
#' the function gggenomes::read_feats
is able to read VCF files and converts them into a format that is applicable within the gggenomes' track system.
Keep in mind: The function uses data from the feats' track.
A ggplot2 layer with variant information.
# Creation of example data. # (Note: These are mere examples and do not fully resemble data from VCF-files) ## Small example data set f1 <- tibble::tibble( seq_id = c(rep(c("A", "B"), 4)), start = c(1, 10, 15, 15, 30, 40, 40, 50), end = c(2, 11, 20, 16, 31, 41, 50, 51), length = end - start, type = c("SNP", "SNP", "Insertion", "Deletion", "Deletion", "SNP", "Insertion", "SNP"), ALT = c("A", "T", "CAT", ".", ".", "G", "GG", "G"), REF = c("C", "G", "C", "A", "A", "C", "G", "T") ) s1 <- tibble::tibble(seq_id = c("A", "B"), start = c(0, 0), end = c(55, 55), length = end - start) ## larger example data set f2 <- tibble::tibble( seq_id = c(rep("A", 667)), start = c( seq(from = 1, to = 500, by = 2), seq(from = 500, to = 2500, by = 50), seq(from = 2500, to = 4000, by = 4) ), end = start + 1, length = end - start, type = c( rep("SNP", 100), rep("Deletion", 20), rep("SNP", 180), rep("Deletion", 67), rep("SNP", 100), rep("Insertion", 50), rep("SNP", 150) ), ALT = c( sample(x = c("A", "C", "G", "T"), size = 100, replace = TRUE), rep(".", 20), sample(x = c("A", "C", "G", "T"), size = 180, replace = TRUE), rep(".", 67), sample(x = c("A", "C", "G", "T"), size = 100, replace = TRUE), sample(x = c( "AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", "GG", "GT", "TA", "TC", "TG", "TT" ), size = 50, replace = TRUE), sample(x = c("A", "C", "G", "T"), size = 150, replace = TRUE) ) ) # Basic example plot with geom_variant gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant() # Improving plot elements, by changing shape and adding bin_label gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_bin_label() # Positional adjustment based on type of mutation: position_variant gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant( aes(shape = type), position = position_variant(offset = c(Insertion = -0.2, Deletion = -0.2, SNP = 0)) ) + scale_shape_variant() + geom_bin_label() # Plotting larger example data set with Changing default geom to # `geom = "ticks"` using positional adjustment based on type (`position_variant`) gggenomes(feats = f2) + geom_variant(aes(color = type), geom = "ticks", alpha = 0.4, position = position_variant()) + geom_bin_label() # Changing geom to `"text"`, to plot ALT nucleotides gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_variant(aes(label = ALT), geom = "text", offset = -0.25) + geom_bin_label()
# Creation of example data. # (Note: These are mere examples and do not fully resemble data from VCF-files) ## Small example data set f1 <- tibble::tibble( seq_id = c(rep(c("A", "B"), 4)), start = c(1, 10, 15, 15, 30, 40, 40, 50), end = c(2, 11, 20, 16, 31, 41, 50, 51), length = end - start, type = c("SNP", "SNP", "Insertion", "Deletion", "Deletion", "SNP", "Insertion", "SNP"), ALT = c("A", "T", "CAT", ".", ".", "G", "GG", "G"), REF = c("C", "G", "C", "A", "A", "C", "G", "T") ) s1 <- tibble::tibble(seq_id = c("A", "B"), start = c(0, 0), end = c(55, 55), length = end - start) ## larger example data set f2 <- tibble::tibble( seq_id = c(rep("A", 667)), start = c( seq(from = 1, to = 500, by = 2), seq(from = 500, to = 2500, by = 50), seq(from = 2500, to = 4000, by = 4) ), end = start + 1, length = end - start, type = c( rep("SNP", 100), rep("Deletion", 20), rep("SNP", 180), rep("Deletion", 67), rep("SNP", 100), rep("Insertion", 50), rep("SNP", 150) ), ALT = c( sample(x = c("A", "C", "G", "T"), size = 100, replace = TRUE), rep(".", 20), sample(x = c("A", "C", "G", "T"), size = 180, replace = TRUE), rep(".", 67), sample(x = c("A", "C", "G", "T"), size = 100, replace = TRUE), sample(x = c( "AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", "GG", "GT", "TA", "TC", "TG", "TT" ), size = 50, replace = TRUE), sample(x = c("A", "C", "G", "T"), size = 150, replace = TRUE) ) ) # Basic example plot with geom_variant gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant() # Improving plot elements, by changing shape and adding bin_label gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_bin_label() # Positional adjustment based on type of mutation: position_variant gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant( aes(shape = type), position = position_variant(offset = c(Insertion = -0.2, Deletion = -0.2, SNP = 0)) ) + scale_shape_variant() + geom_bin_label() # Plotting larger example data set with Changing default geom to # `geom = "ticks"` using positional adjustment based on type (`position_variant`) gggenomes(feats = f2) + geom_variant(aes(color = type), geom = "ticks", alpha = 0.4, position = position_variant()) + geom_bin_label() # Changing geom to `"text"`, to plot ALT nucleotides gggenomes(seqs = s1, feats = f1) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_variant(aes(label = ALT), geom = "text", offset = -0.25) + geom_bin_label()
Geom for feature text
GeomFeatText
GeomFeatText
An object of class GeomFeatText
(inherits from Geom
, ggproto
, gg
) of length 6.
Get/set the seqs track
get_seqs(x) set_seqs(x, value)
get_seqs(x) set_seqs(x, value)
x |
a gggenomes or gggenomes_layout objekt |
value |
to set for seqs |
a gggenomes_layout track tibble
gggenomes()
initializes a gggenomes-flavored ggplot object.
It is used to declare the input data for gggenomes' track system.
(See for more details on the track system, gggenomes vignette or the Details/Arguments section)
gggenomes( genes = NULL, seqs = NULL, feats = NULL, links = NULL, .id = "file_id", spacing = 0.05, wrap = NULL, adjacent_only = TRUE, infer_bin_id = seq_id, infer_start = min(start, end), infer_end = max(start, end), infer_length = max(start, end), theme = c("clean", NULL), .layout = NULL, ... )
gggenomes( genes = NULL, seqs = NULL, feats = NULL, links = NULL, .id = "file_id", spacing = 0.05, wrap = NULL, adjacent_only = TRUE, infer_bin_id = seq_id, infer_start = min(start, end), infer_end = max(start, end), infer_length = max(start, end), theme = c("clean", NULL), .layout = NULL, ... )
genes , feats
|
A data.frame, a list of data.frames, or a character vector with paths to files containing gene data. Each item is added as feature track. For a single data.frame the track_id will be "genes" and "feats", respectively. For a list, track_ids are parsed from the list names, or if names are missing from the name of the variable containing each data.frame. Data columns:
|
seqs |
A data.frame or a character vector with paths to files containing sequence data. Data columns:
|
links |
A data.frame or a character vector with paths to files containing link data. Each item is added as links track. Data columns:
|
.id |
The name of the column for file labels that are created when reading directly from files. Defaults to "file_id". Set to "bin_id" if every file represents a different bin. |
spacing |
between sequences in bases (>1) or relative to longest bin (<1) |
wrap |
wrap bins into multiple lines with at most this many nucleotides per lin. |
adjacent_only |
Indicates whether links should be created between adjacent sequences/chromosomes only.
By default it is set to (not recommended for large data sets) |
infer_length , infer_start , infer_end , infer_bin_id
|
used to infer pseudo seqs if only feats or links are provided, or if no bin_id column was provided. The expressions are evaluated in the context of the first feat or link track. By default subregions of sequences from the first to the last feat/link
are generated. Set |
theme |
choose a gggenomes default theme, NULL to omit. |
.layout |
a pre-computed layout from |
... |
additional parameters, passed to layout |
gggenomes::gggenomes()
resembles the functionality of ggplot2::ggplot()
.
It is used to construct the initial plot object, and is often followed by "+" to add components to the plot (e.g. "+ geom_gene()").
A big difference between the two is that gggenomes has a multi-track setup ('seqs'
, 'feats'
, 'genes'
and 'links'
).
gggenomes()
pre-computes a layout and adds coordinates (y,x,xend
) to each data frame prior to the actual plot construction.
This has some implications for the usage of gggenomes:
Data frames for tracks have required variables. These predefined variables are used during import to compute x/y coordinates (see arguments).
gggenomes' geoms can often be used without explicit aes()
mappings This works because
we always know the names of the plot variables ahead of time: they originate from the pre-computed layout,
and we can use that information to set sensible default aesthetic mappings for most cases.
gggenomes-flavored ggplot object
# Compare the genomic organization of three viral elements # EMALEs: endogenous mavirus-like elements (example data shipped with gggenomes) gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) + geom_seq() + geom_bin_label() + # chromosomes and labels geom_feat(size = 8) + # terminal inverted repeats geom_gene(aes(fill = strand), position = "strand") + # genes geom_link(offset = 0.15) # synteny-blocks # with some more information gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) %>% add_feats(emale_ngaros, emale_gc) %>% add_clusters(emale_cogs) %>% sync() + geom_link(offset = 0.15, color = "white") + # synteny-blocks geom_seq() + geom_bin_label() + # chromosomes and labels # thistle4, salmon4, burlywood4 geom_feat(size = 6, position = "identity") + # terminal inverted repeats geom_feat( data = feats(emale_ngaros), color = "turquoise4", alpha = .3, position = "strand", size = 16 ) + geom_feat_note(aes(label = type), data = feats(emale_ngaros), position = "strand", nudge_y = .3 ) + geom_gene(aes(fill = cluster_id), position = "strand") + # genes geom_wiggle(aes(z = score, linetype = "GC-content"), feats(emale_gc), fill = "lavenderblush4", position = position_nudge(y = -.2), height = .2 ) + scale_fill_brewer("Conserved genes", palette = "Dark2", na.value = "cornsilk3") # initialize plot directly from files gggenomes( ex("emales/emales.gff"), ex("emales/emales.gff"), ex("emales/emales-tirs.gff"), ex("emales/emales.paf") ) + geom_seq() + geom_gene() + geom_feat() + geom_link() # multi-contig genomes wrap to fixed width s0 <- read_seqs(list.files(ex("cafeteria"), "Cr.*\\.fa.fai$", full.names = TRUE)) s1 <- s0 %>% dplyr::filter(length > 5e5) gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label() + geom_seq_label()
# Compare the genomic organization of three viral elements # EMALEs: endogenous mavirus-like elements (example data shipped with gggenomes) gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) + geom_seq() + geom_bin_label() + # chromosomes and labels geom_feat(size = 8) + # terminal inverted repeats geom_gene(aes(fill = strand), position = "strand") + # genes geom_link(offset = 0.15) # synteny-blocks # with some more information gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) %>% add_feats(emale_ngaros, emale_gc) %>% add_clusters(emale_cogs) %>% sync() + geom_link(offset = 0.15, color = "white") + # synteny-blocks geom_seq() + geom_bin_label() + # chromosomes and labels # thistle4, salmon4, burlywood4 geom_feat(size = 6, position = "identity") + # terminal inverted repeats geom_feat( data = feats(emale_ngaros), color = "turquoise4", alpha = .3, position = "strand", size = 16 ) + geom_feat_note(aes(label = type), data = feats(emale_ngaros), position = "strand", nudge_y = .3 ) + geom_gene(aes(fill = cluster_id), position = "strand") + # genes geom_wiggle(aes(z = score, linetype = "GC-content"), feats(emale_gc), fill = "lavenderblush4", position = position_nudge(y = -.2), height = .2 ) + scale_fill_brewer("Conserved genes", palette = "Dark2", na.value = "cornsilk3") # initialize plot directly from files gggenomes( ex("emales/emales.gff"), ex("emales/emales.gff"), ex("emales/emales-tirs.gff"), ex("emales/emales.paf") ) + geom_seq() + geom_gene() + geom_feat() + geom_link() # multi-contig genomes wrap to fixed width s0 <- read_seqs(list.files(ex("cafeteria"), "Cr.*\\.fa.fai$", full.names = TRUE)) s1 <- s0 %>% dplyr::filter(length > 5e5) gggenomes(seqs = s1, infer_bin_id = file_id, wrap = 5e6) + geom_seq() + geom_bin_label() + geom_seq_label()
Vectorised if_else based on strandedness
if_reverse(strand, reverse, forward)
if_reverse(strand, reverse, forward)
strand |
vector with strandedness information |
reverse |
value to use for reverse elements |
forward |
value to use for forward elements |
vector with values based on strandedness
Do numeric values fall into specified ranges?
in_range(x, left, right, closed = TRUE)
in_range(x, left, right, closed = TRUE)
x |
a numeric vector of values |
left , right
|
boundary values or vectors of same length as x |
closed |
wether to include ( |
a logical vector of the same length as the input
in_range(1:5, 2, 4) in_range(1:5, 2, 4, closed = c(FALSE, TRUE)) # left-open in_range(1:5, 6:2, 3) # vector of boundaries, single values recycle # plays nicely with dplyr df <- tibble::tibble(x = rep(4, 5), left = 1:5, right = 3:7) dplyr::mutate(df, closed = in_range(x, left, right, TRUE), open = in_range(x, left, right, FALSE) )
in_range(1:5, 2, 4) in_range(1:5, 2, 4, closed = c(FALSE, TRUE)) # left-open in_range(1:5, 6:2, 3) # vector of boundaries, single values recycle # plays nicely with dplyr df <- tibble::tibble(x = rep(4, 5), left = 1:5, right = 3:7) dplyr::mutate(df, closed = in_range(x, left, right, TRUE), open = in_range(x, left, right, FALSE) )
Works like dplyr::mutate()
but without changing existing columns, but only
adding new ones. Useful to add possibly missing columns with default values.
introduce(.data, ...)
introduce(.data, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
a tibble with new columns
# ensure columns "y" and "z" exist tibble::tibble(x = 1:3) %>% introduce(y = "a", z = paste0(y, dplyr::row_number())) # ensure columns "y" and "z" exist, but do not overwrite "y" tibble::tibble(x = 1:3, y = c("c", "d", "e")) %>% introduce(y = "a", z = paste0(y, dplyr::row_number()))
# ensure columns "y" and "z" exist tibble::tibble(x = 1:3) %>% introduce(y = "a", z = paste0(y, dplyr::row_number())) # ensure columns "y" and "z" exist, but do not overwrite "y" tibble::tibble(x = 1:3, y = c("c", "d", "e")) %>% introduce(y = "a", z = paste0(y, dplyr::row_number()))
Check whether strand is reverse
is_reverse(strand, na = FALSE)
is_reverse(strand, na = FALSE)
strand |
some representation for strandedness |
na |
what to use for |
logical vector indicating whether the strand is reverse
Re-layout the tracks and update the scales after seqs have been modified
layout(x, ...)
layout(x, ...)
x |
layout |
... |
additional data |
layout with updated scales
Layout sequences
layout_seqs( x, spacing = 0.05, wrap = NULL, spacing_style = c("regular", "center", "spread"), keep = "strand" )
layout_seqs( x, spacing = 0.05, wrap = NULL, spacing_style = c("regular", "center", "spread"), keep = "strand" )
x |
seq_layout |
spacing |
between sequences in bases (>1) or relative to longest bin (<1) |
wrap |
wrap bins into multiple lines with at most this many nucleotides per lin. |
spacing_style |
one of "regular", "center", "spread" |
keep |
keys to keep (default: "strand") |
a tbl_df with plot coordinates
Pick which bins and seqs to show and in what order. Uses
dplyr::select()
-like syntax, which means unquoted genome names, positional
arguments and selection helpers, such as
tidyselect::starts_with()
are supported. Renaming is not supported.
pick(x, ...) pick_seqs(x, ..., .bins = everything()) pick_seqs_within(x, ..., .bins = everything()) pick_by_tree(x, tree, infer_bin_id = .data$label)
pick(x, ...) pick_seqs(x, ..., .bins = everything()) pick_seqs_within(x, ..., .bins = everything()) pick_by_tree(x, tree, infer_bin_id = .data$label)
x |
gggenomes object |
... |
bins/seqs to pick, select-like expression. |
.bins |
scope for positional arguments, select-like expression, enclose
multiple arguments with |
tree |
a phylogenetic tree in ggtree::ggtree or |
infer_bin_id |
an expression to extract bin_ids from the tree data. |
Use the dots to select bins or sequences (depending on function suffix), and
the .bins
argument to set the scope for positional arguments. For example,
pick_seqs(1)
will pick the first sequence from the first bin, while
pick_seqs(1, .bins=3)
will pick the first sequence from the third bin.
gggenomes object with selected bins and seqs.
gggenomes object with selected seqs.
gggenomes object with selected seqs.
gggenomes object with seqs selected by tree order.
pick()
: pick bins by bin_id, positional argument (start at top)
or select-helper.
pick_seqs()
: pick individual seqs seq_id, positional argument (start at
top left) or select-helper.
pick_seqs_within()
: pick individual seqs but only modify bins containing those
seqs, keep rest as is.
pick_by_tree()
: align bins with the leaves in a given phylogenetic tree.
s0 <- tibble::tibble( bin_id = c("A", "B", "B", "B", "C", "C", "C"), seq_id = c("a1", "b1", "b2", "b3", "c1", "c2", "c3"), length = c(1e4, 6e3, 2e3, 1e3, 3e3, 3e3, 3e3) ) p <- gggenomes(seqs = s0) + geom_seq(aes(color = bin_id), size = 3) + geom_bin_label() + geom_seq_label() + expand_limits(color = c("A", "B", "C")) p # remove p %>% pick(-B) # select and reorder, by ID and position p %>% pick(C, 1) # use helper function p %>% pick(starts_with("B")) # pick just some seqs p %>% pick_seqs(1, c3) # pick with .bin scope p %>% pick_seqs(3:1, .bins = C) # change seqs in some bins, but keep rest as is p %>% pick_seqs_within(3:1, .bins = B) # same w/o scope, unaffected bins remain as is p %>% pick_seqs_within(b3, b2, b1) # Align sequences with and plot next to a phylogenetic tree library(patchwork) # arrange multiple plots library(ggtree) # plot phylogenetic trees # load and plot a phylogenetic tree emale_mcp_tree <- read.tree(ex("emales/emales-MCP.nwk")) t <- ggtree(emale_mcp_tree) + geom_tiplab(align = TRUE, size = 3) + xlim(0, 0.05) # make room for labels p <- gggenomes(seqs = emale_seqs, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() # plot next to each other, but with # different order in tree and genomes t + p + plot_layout(widths = c(1, 5)) # reorder genomes to match tree order # with a warning caused by mismatch in y-scale expansions t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) # extra genomes are dropped with a notification emale_seqs_more <- emale_seqs emale_seqs_more[7, ] <- emale_seqs_more[6, ] emale_seqs_more$seq_id[7] <- "One more genome" p <- gggenomes(seqs = emale_seqs_more, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) try({ # no shared ids will cause an error p <- gggenomes(seqs = tibble::tibble(seq_id = "foo", length = 1)) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) # extra leafs in tree will cause an error emale_seqs_fewer <- slice_head(emale_seqs, n = 4) p <- gggenomes(seqs = emale_seqs_fewer, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) })
s0 <- tibble::tibble( bin_id = c("A", "B", "B", "B", "C", "C", "C"), seq_id = c("a1", "b1", "b2", "b3", "c1", "c2", "c3"), length = c(1e4, 6e3, 2e3, 1e3, 3e3, 3e3, 3e3) ) p <- gggenomes(seqs = s0) + geom_seq(aes(color = bin_id), size = 3) + geom_bin_label() + geom_seq_label() + expand_limits(color = c("A", "B", "C")) p # remove p %>% pick(-B) # select and reorder, by ID and position p %>% pick(C, 1) # use helper function p %>% pick(starts_with("B")) # pick just some seqs p %>% pick_seqs(1, c3) # pick with .bin scope p %>% pick_seqs(3:1, .bins = C) # change seqs in some bins, but keep rest as is p %>% pick_seqs_within(3:1, .bins = B) # same w/o scope, unaffected bins remain as is p %>% pick_seqs_within(b3, b2, b1) # Align sequences with and plot next to a phylogenetic tree library(patchwork) # arrange multiple plots library(ggtree) # plot phylogenetic trees # load and plot a phylogenetic tree emale_mcp_tree <- read.tree(ex("emales/emales-MCP.nwk")) t <- ggtree(emale_mcp_tree) + geom_tiplab(align = TRUE, size = 3) + xlim(0, 0.05) # make room for labels p <- gggenomes(seqs = emale_seqs, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() # plot next to each other, but with # different order in tree and genomes t + p + plot_layout(widths = c(1, 5)) # reorder genomes to match tree order # with a warning caused by mismatch in y-scale expansions t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) # extra genomes are dropped with a notification emale_seqs_more <- emale_seqs emale_seqs_more[7, ] <- emale_seqs_more[6, ] emale_seqs_more$seq_id[7] <- "One more genome" p <- gggenomes(seqs = emale_seqs_more, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) try({ # no shared ids will cause an error p <- gggenomes(seqs = tibble::tibble(seq_id = "foo", length = 1)) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) # extra leafs in tree will cause an error emale_seqs_fewer <- slice_head(emale_seqs, n = 4) p <- gggenomes(seqs = emale_seqs_fewer, genes = emale_genes) + geom_seq() + geom_seq() + geom_bin_label() t + p %>% pick_by_tree(t) + plot_layout(widths = c(1, 5)) })
position_strand()
offsets forward feats upward and reverse feats downward.
position_pile()
stacks overlapping feats upward. position_strandpile()
stacks overlapping feats up-/downward based on their strand.
position_sixframe()
offsets the feats based on their strand and reading
frame.
position_strand(offset = 0.1, flip = FALSE, grouped = NULL, base = offset/2) position_pile(offset = 0.1, gap = 1, flip = FALSE, grouped = NULL, base = 0) position_strandpile( offset = 0.1, gap = 1, flip = FALSE, grouped = NULL, base = offset * 1.5 ) position_sixframe(offset = 0.1, flip = FALSE, grouped = NULL, base = offset/2)
position_strand(offset = 0.1, flip = FALSE, grouped = NULL, base = offset/2) position_pile(offset = 0.1, gap = 1, flip = FALSE, grouped = NULL, base = 0) position_strandpile( offset = 0.1, gap = 1, flip = FALSE, grouped = NULL, base = offset * 1.5 ) position_sixframe(offset = 0.1, flip = FALSE, grouped = NULL, base = offset/2)
offset |
Shift overlapping feats up/down this much on the y-axis. The y-axis distance between two sequences is 1, so this is usually a small fraction, such as 0.1. |
flip |
stack downward, and for stranded versions reverse upward. |
grouped |
if TRUE feats in the same group are stacked as a single feature. Useful to move CDS and mRNA as one unit. If NULL (default) set to TRUE if data appears to contain gene-ish features. |
base |
How to align the stack relative to the sequence. 0 to center the lowest stack level on the sequence, 1 to put forward/reverse sequence one half offset above/below the sequence line. |
gap |
If two feats are closer together than this, they will be stacked. Can be negative to allow small overlaps. NA disables stacking. |
A ggproto object to be used in geom_gene()
.
library(patchwork) p <- gggenomes(emale_genes) %>% pick(3:4) + geom_seq() f0 <- tibble::tibble( seq_id = pull_seqs(p)$seq_id[1], start = 1:20 * 1000, end = start + 2500, strand = rep(c("+", "-"), length(start) / 2) ) sixframe <- function(x, strand) as.character((x %% 3 + 1) * strand_int(strand)) p1 <- p + geom_gene() p2 <- p + geom_gene(aes(fill = strand), position = "strand") p3 <- p + geom_gene(aes(fill = strand), position = position_strand(flip = TRUE, base = 0.2)) p4 <- p + geom_gene(aes(fill = sixframe(x, strand)), position = "sixframe") p5 <- p %>% add_feats(f0) + geom_gene() + geom_feat(aes(color = strand)) p6 <- p %>% add_feats(f0) + geom_gene() + geom_feat(aes(color = strand), position = "strandpile") p1 + p2 + p3 + p4 + p5 + p6 + plot_layout(ncol = 3, guides = "collect") & ylim(2.5, 0.5)
library(patchwork) p <- gggenomes(emale_genes) %>% pick(3:4) + geom_seq() f0 <- tibble::tibble( seq_id = pull_seqs(p)$seq_id[1], start = 1:20 * 1000, end = start + 2500, strand = rep(c("+", "-"), length(start) / 2) ) sixframe <- function(x, strand) as.character((x %% 3 + 1) * strand_int(strand)) p1 <- p + geom_gene() p2 <- p + geom_gene(aes(fill = strand), position = "strand") p3 <- p + geom_gene(aes(fill = strand), position = position_strand(flip = TRUE, base = 0.2)) p4 <- p + geom_gene(aes(fill = sixframe(x, strand)), position = "sixframe") p5 <- p %>% add_feats(f0) + geom_gene() + geom_feat(aes(color = strand)) p6 <- p %>% add_feats(f0) + geom_gene() + geom_feat(aes(color = strand), position = "strandpile") p1 + p2 + p3 + p4 + p5 + p6 + plot_layout(ncol = 3, guides = "collect") & ylim(2.5, 0.5)
position_variant()
allows the user to plot the different mutation types (e.g. del, ins, snps) at different offsets from the base.
This can especially be useful to highlight in which regions certain types of mutations have higher prevalence.
This position adjustment is most relevant for the analysis/visualization of VCF files with the function geom_variant()
.
position_variant(offset = c(del = 0.1, snp = 0, ins = -0.1), base = 0)
position_variant(offset = c(del = 0.1, snp = 0, ins = -0.1), base = 0)
offset |
Shifts the data up/down based on the type of mutation.
By default |
base |
How to align the offsets relative to the sequence. At base = 0, plotting of the offsets starts
from the sequence. |
A ggproto object to be used in geom_variant()
.
# Creation of example data. testposition <- tibble::tibble( type = c("ins", "snp", "snp", "del", "del", "snp", "snp", "ins", "snp", "ins", "snp"), start = c(10, 20, 30, 35, 40, 60, 65, 90, 90, 100, 120), end = start + 1, seq_id = c(rep("A", 11)) ) testseq <- tibble::tibble( seq_id = "A", start = 0, end = 150, length = end - start ) p <- gggenomes(seqs = testseq, feats = testposition) # This first plot shows what is being plotted when only geom_variant is called p + geom_variant() # Next lets use position_variant, and change the shape aesthetic by column `type` p + geom_variant(aes(shape = type), position = position_variant()) # Now lets create a plot with different offsets by inserting a self-created vector. p + geom_variant( aes(shape = type), position = position_variant(c(del = 0.4, ins = -0.4)) ) + scale_shape_variant() # Changing the base will shift all points up/down relatively from the sequence. p + geom_variant( aes(shape = type), position = position_variant(base = 0.5) ) + geom_seq()
# Creation of example data. testposition <- tibble::tibble( type = c("ins", "snp", "snp", "del", "del", "snp", "snp", "ins", "snp", "ins", "snp"), start = c(10, 20, 30, 35, 40, 60, 65, 90, 90, 100, 120), end = start + 1, seq_id = c(rep("A", 11)) ) testseq <- tibble::tibble( seq_id = "A", start = 0, end = 150, length = end - start ) p <- gggenomes(seqs = testseq, feats = testposition) # This first plot shows what is being plotted when only geom_variant is called p + geom_variant() # Next lets use position_variant, and change the shape aesthetic by column `type` p + geom_variant(aes(shape = type), position = position_variant()) # Now lets create a plot with different offsets by inserting a self-created vector. p + geom_variant( aes(shape = type), position = position_variant(c(del = 0.4, ins = -0.4)) ) + scale_shape_variant() # Changing the base will shift all points up/down relatively from the sequence. p + geom_variant( aes(shape = type), position = position_variant(base = 0.5) ) + geom_seq()
this file contains sequences, links and (optionally) genes
read_alitv(file)
read_alitv(file)
file |
path to json |
list with seqs, genes, and links
ali <- read_alitv("https://alitvteam.github.io/AliTV/d3/data/chloroplasts.json") gggenomes(ali$genes, ali$seqs, links = ali$links) + geom_seq() + geom_bin_label() + geom_gene(aes(fill = class)) + geom_link() p <- gggenomes(ali$genes, ali$seqs, links = ali$links) + geom_seq() + geom_bin_label() + geom_gene(aes(color = class)) + geom_link(aes(fill = identity)) + scale_fill_distiller(palette = "RdYlGn", direction = 1) p %>% flip_seqs(5) %>% pick_seqs(1, 3, 2, 4, 5, 6, 7, 8)
ali <- read_alitv("https://alitvteam.github.io/AliTV/d3/data/chloroplasts.json") gggenomes(ali$genes, ali$seqs, links = ali$links) + geom_seq() + geom_bin_label() + geom_gene(aes(fill = class)) + geom_link() p <- gggenomes(ali$genes, ali$seqs, links = ali$links) + geom_seq() + geom_bin_label() + geom_gene(aes(color = class)) + geom_link(aes(fill = identity)) + scale_fill_distiller(palette = "RdYlGn", direction = 1) p %>% flip_seqs(5) %>% pick_seqs(1, 3, 2, 4, 5, 6, 7, 8)
BED files use 0-based coordinate starts, while gggenomes uses 1-based start coordinates. BED file coordinates are therefore transformed into 1-based coordinates during import.
read_bed(file, col_names = def_names("bed"), col_types = def_types("bed"), ...)
read_bed(file, col_names = def_names("bed"), col_types = def_types("bed"), ...)
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
col_names |
column names to use. Defaults to |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
By default, reading a file without a column specification will print a
message showing what |
... |
additional parameters, passed to |
tibble
Read BLAST tab-separated output
read_blast( file, col_names = def_names("blast"), col_types = def_types("blast"), comment = "#", swap_query = FALSE, ... )
read_blast( file, col_names = def_names("blast"), col_types = def_types("blast"), comment = "#", swap_query = FALSE, ... )
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
col_names |
column names to use. Defaults to |
col_types |
column types to use. Defaults to |
comment |
character |
swap_query |
if TRUE swap query and subject columns using |
... |
additional parameters, passed to |
a tibble with the BLAST output
Powers read_seqs()
, read_feats()
, read_links()
read_context( files, context, .id = "file_id", format = NULL, parser = NULL, ... )
read_context( files, context, .id = "file_id", format = NULL, parser = NULL, ... )
files |
files to reads. Should all be of same format. In many cases,
compressed files ( |
context |
the context ("seqs", "feats", "links") in which a given format should be read. |
.id |
the column with the name of the file a record was read from. Defaults to "file_id". Set to "bin_id" if every file represents a different bin. |
format |
specify a format known to gggenomes, such as |
parser |
specify the name of an R function to overwrite automatic
determination based on format, e.g. |
... |
additional arguments passed on to the format-specific read function called down the line. |
a tibble with the combined data from all files
read_context()
: bla keywords internal
Genbank flat files (.gb/.gbk/.gbff) and their ENA and DDBJ equivalents have a
particularly gruesome format. That's why read_gbk()
is just a wrapper
around a Perl-based gb2gff
converter and read_gff3()
.
read_gbk(file, sources = NULL, types = NULL, infer_cds_parents = TRUE)
read_gbk(file, sources = NULL, types = NULL, infer_cds_parents = TRUE)
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
sources |
only return features from these sources |
types |
only return features of these types, e.g. gene, CDS, ... |
infer_cds_parents |
infer the mRNA parent for CDS features based on overlapping coordinates. Default TRUE for gff2/gtf, FALSE for gff3. In most GFFs this is properly set, but sometimes this information is missing. Generally, this is not a problem, however, geom_gene calls parse the parent information to determine which CDS and mRNAs are part of the same gene model. Without the parent info, mRNA and CDS are plotted as individual features. |
tibble
Files with ##FASTA
section work but result in parsing problems for all
lines of the fasta section. Just ignore those warnings, or strip the fasta
section ahead of time from the file.
read_gff3( file, sources = NULL, types = NULL, infer_cds_parents = is_gff2, sort_exons = TRUE, col_names = def_names("gff3"), col_types = def_types("gff3"), keep_attr = FALSE, fix_augustus_cds = TRUE, is_gff2 = NULL )
read_gff3( file, sources = NULL, types = NULL, infer_cds_parents = is_gff2, sort_exons = TRUE, col_names = def_names("gff3"), col_types = def_types("gff3"), keep_attr = FALSE, fix_augustus_cds = TRUE, is_gff2 = NULL )
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
sources |
only return features from these sources |
types |
only return features of these types, e.g. gene, CDS, ... |
infer_cds_parents |
infer the mRNA parent for CDS features based on overlapping coordinates. Default TRUE for gff2/gtf, FALSE for gff3. In most GFFs this is properly set, but sometimes this information is missing. Generally, this is not a problem, however, geom_gene calls parse the parent information to determine which CDS and mRNAs are part of the same gene model. Without the parent info, mRNA and CDS are plotted as individual features. |
sort_exons |
make sure that exons/introns appear sorted. Default TRUE. Set to FALSE to read CDS/exon order exactly as present in the file, which is less robust, but faster and allows non-canonical splicing (exon1-exon3-exon2). |
col_names |
column names to use. Defaults to |
col_types |
column types to use. Defaults to |
keep_attr |
keep the original attributes column also after parsing tag=value pairs into tidy columns. |
fix_augustus_cds |
If true, assume Augustus gff with bad CDS IDs that need fixing |
is_gff2 |
set if file is in gff2 format |
tibble
Read a minimap/minimap2 .paf file including optional tagged extra fields. The optional fields will be parsed into a tidy format, one column per tag.
read_paf( file, max_tags = 20, col_names = def_names("paf"), col_types = def_types("paf"), ... )
read_paf( file, max_tags = 20, col_names = def_names("paf"), col_types = def_types("paf"), ... )
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
max_tags |
maximum number of optional fields to include |
col_names |
column names to use. Defaults to |
col_types |
column types to use. Defaults to |
... |
additional parameters, passed to |
Because readr::read_tsv
expects a fixed number of columns, but in .paf the
number of optional fields can differ among records, read_paf
tries to read
at least as many columns as the longest record has (max_tags
). The
resulting warnings for each record with fewer fields of the form "32 columns
expected, only 22 seen" should thus be ignored.
From the minimap2 manual
+—-+——–+———————————————————+ |Col | Type | Description | +—-+——–+———————————————————+ | 1 | string | Query sequence name | | 2 | int | Query sequence length | | 3 | int | Query start coordinate (0-based) | | 4 | int | Query end coordinate (0-based) | | 5 | char | ‘+’ if query/target on the same strand; ‘-’ if opposite | | 6 | string | Target sequence name | | 7 | int | Target sequence length | | 8 | int | Target start coordinate on the original strand | | 9 | int | Target end coordinate on the original strand | | 10 | int | Number of matching bases in the mapping | | 11 | int | Number bases, including gaps, in the mapping | | 12 | int | Mapping quality (0-255 with 255 for missing) | +—-+——–+———————————————————+
+—-+——+——————————————————-+ |Tag | Type | Description | +—-+——+——————————————————-+ | tp | A | Type of aln: P/primary, S/secondary and I,i/inversion | | cm | i | Number of minimizers on the chain | | s1 | i | Chaining score | | s2 | i | Chaining score of the best secondary chain | | NM | i | Total number of mismatches and gaps in the alignment | | MD | Z | To generate the ref sequence in the alignment | | AS | i | DP alignment score | | ms | i | DP score of the max scoring segment in the alignment | | nn | i | Number of ambiguous bases in the alignment | | ts | A | Transcript strand (splice mode only) | | cg | Z | CIGAR string (only in PAF) | | cs | Z | Difference string | | dv | f | Approximate per-base sequence divergence | +—-+——+——————————————————-+
From https://samtools.github.io/hts-specs/SAMtags.pdf type may be one of A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string).
tibble
Read sequence index
read_seq_len(file) read_fai(file, col_names = def_names("fai"), col_types = def_types("fai"), ...)
read_seq_len(file) read_fai(file, col_names = def_names("fai"), col_types = def_types("fai"), ...)
file |
with sequence length information |
col_names |
Either If If Missing ( |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
By default, reading a file without a column specification will print a
message showing what |
... |
additional parameters, passed to |
tibble with sequence information
tibble with sequence information
read_seq_len()
: read seqs from a single file_name in fasta, gbk or gff3 format.
read_fai()
: read seqs from a single file in seqkit/samtools fai format.
Convenience functions to read sequences, features or links from various
bioinformatics file formats, such as FASTA, GFF3, Genbank, BLAST tabular
output, etc. See def_formats()
for full list. File formats and the
corresponding read-functions are automatically determined based on file
extensions. All these functions can read multiple files in the same format at
once, and combine them into a single table - useful, for example, to read a
folder of gff-files with each file containing genes of a different genome.
read_feats(files, .id = "file_id", format = NULL, parser = NULL, ...) read_subfeats(files, .id = "file_id", format = NULL, parser = NULL, ...) read_links(files, .id = "file_id", format = NULL, parser = NULL, ...) read_sublinks(files, .id = "file_id", format = NULL, parser = NULL, ...) read_seqs( files, .id = "file_id", format = NULL, parser = NULL, parse_desc = TRUE, ... )
read_feats(files, .id = "file_id", format = NULL, parser = NULL, ...) read_subfeats(files, .id = "file_id", format = NULL, parser = NULL, ...) read_links(files, .id = "file_id", format = NULL, parser = NULL, ...) read_sublinks(files, .id = "file_id", format = NULL, parser = NULL, ...) read_seqs( files, .id = "file_id", format = NULL, parser = NULL, parse_desc = TRUE, ... )
files |
files to reads. Should all be of same format. In many cases,
compressed files ( |
.id |
the column with the name of the file a record was read from. Defaults to "file_id". Set to "bin_id" if every file represents a different bin. |
format |
specify a format known to gggenomes, such as |
parser |
specify the name of an R function to overwrite automatic
determination based on format, e.g. |
... |
additional arguments passed on to the format-specific read function called down the line. |
parse_desc |
turn |
A gggenomes-compatible sequence, feature or link tibble
tibble with features
tibble with features
tibble with links
tibble with links
tibble with sequence information
read_feats()
: read files as features mapping onto
sequences.
read_subfeats()
: read files as subfeatures mapping onto other features
read_links()
: read files as links connecting sequences
read_sublinks()
: read files as sublinks connecting features
read_seqs()
: read sequence ID, description and length.
# read genes/features from a gff file read_feats(ex("eden-utr.gff")) # read all gff files from a directory read_feats(list.files(ex("emales/"), "*.gff$", full.names = TRUE)) # read remote files gbk_phages <- c( PSSP7 = paste0( "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/", "000/858/745/GCF_000858745.1_ViralProj15134/", "GCF_000858745.1_ViralProj15134_genomic.gff.gz" ), PSSP3 = paste0( "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/", "000/904/555/GCF_000904555.1_ViralProj195517/", "GCF_000904555.1_ViralProj195517_genomic.gff.gz" ) ) read_feats(gbk_phages) # read sequences from a fasta file. read_seqs(ex("emales/emales.fna"), parse_desc = FALSE) # read sequence info from a fasta file with `parse_desc=TRUE` (default). `key=value` # pairs are removed from `seq_desc` and parsed into columns with `key` as name read_seqs(ex("emales/emales.fna")) # read sequence info from samtools/seqkit style index read_seqs(ex("emales/emales.fna.seqkit.fai")) # read sequence info from multiple gff file read_seqs(c(ex("emales/emales.gff"), ex("emales/emales-tirs.gff")))
# read genes/features from a gff file read_feats(ex("eden-utr.gff")) # read all gff files from a directory read_feats(list.files(ex("emales/"), "*.gff$", full.names = TRUE)) # read remote files gbk_phages <- c( PSSP7 = paste0( "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/", "000/858/745/GCF_000858745.1_ViralProj15134/", "GCF_000858745.1_ViralProj15134_genomic.gff.gz" ), PSSP3 = paste0( "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/", "000/904/555/GCF_000904555.1_ViralProj195517/", "GCF_000904555.1_ViralProj195517_genomic.gff.gz" ) ) read_feats(gbk_phages) # read sequences from a fasta file. read_seqs(ex("emales/emales.fna"), parse_desc = FALSE) # read sequence info from a fasta file with `parse_desc=TRUE` (default). `key=value` # pairs are removed from `seq_desc` and parsed into columns with `key` as name read_seqs(ex("emales/emales.fna")) # read sequence info from samtools/seqkit style index read_seqs(ex("emales/emales.fna.seqkit.fai")) # read sequence info from multiple gff file read_seqs(c(ex("emales/emales.gff"), ex("emales/emales-tirs.gff")))
VCF (Variant Call Format) file format is used to store variation data and its metadata. Based on the used analysis program (e.g. GATK, freebayes, etc...), details within the VCF file can slightly differ. For example, type of mutation is not mentioned as output for certain variant analysis programs. the "read_vcf" function, ignores the first header/metadata lines and directly converts the data into a tidy dataframe. The function will extract the type of mutation. By absence, it will derive the type of mutation from the "ref" and "alt" column.
read_vcf( file, parse_info = FALSE, col_names = def_names("vcf"), col_types = def_types("vcf") )
read_vcf( file, parse_info = FALSE, col_names = def_names("vcf"), col_types = def_types("vcf") )
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
parse_info |
if set to 'TRUE', the read_vcf function will split all the metadata stored in the "info" column and stores it into separate columns. By default it is set to 'FALSE'. |
col_names |
column names to use. Defaults to |
col_types |
column types to use. Defaults to |
dataframe
Require variables in an object
require_vars(x, vars, warn_only = FALSE)
require_vars(x, vars, warn_only = FALSE)
x |
object |
vars |
required variables |
warn_only |
don't die on missing vars |
the original tibble if all vars are present or warning only
The user can call upon an convenient function called scale_color_variant
,
which changes the color of (SNP) points, based on their nucleotides (A, C, G, T).
By default the function uses a colorblind friendly palette, but users can manually overwrite these colors.
(Within the plotting function (e.g. geom_variant
), coloring of the column should still be mentioned (aes(color = ...)
).
The function scale_shape_variant
changes the shape of plotted points based on the type of mutation.
The user can also manually decide which shape, each specific type of mutation should have.
By default, SNPs are diamond shaped, Deletions triangle downwards and Insertions triangle upwards.
(These default settings make most sense when using geom_variant(offset = -0.2)
).
(User should still manually call which column is used for the shape aesthetic)
scale_color_variant( values = c(A = "#e66101", C = "#b2abd2", G = "#5e3c99", T = "#fdb863"), na.value = "white", ... ) scale_shape_variant( values = c(SNP = 23, Deletion = 25, Insertion = 24), na.value = 1, characters = FALSE, ... )
scale_color_variant( values = c(A = "#e66101", C = "#b2abd2", G = "#5e3c99", T = "#fdb863"), na.value = "white", ... ) scale_shape_variant( values = c(SNP = 23, Deletion = 25, Insertion = 24), na.value = 1, characters = FALSE, ... )
values |
A vector indicating how to color/shape different variables.
The functions |
na.value |
The aesthetic value (color/shape/etc.) to use for non matching values. |
... |
Additional parameters, passed to scale_color_manual |
characters |
When |
A ggplot2 scale object for color or shape.
# Creation of example data. testposition <- tibble::tibble( type = c( "Insertion", "SNP", "SNP", "Deletion", "Deletion", "SNP", "SNP", "Insertion", "SNP", "Insertion", "SNP" ), start = c(10, 20, 30, 35, 40, 60, 65, 90, 90, 100, 120), ALT = c("AT", "G", "C", ".", ".", "T", "C", "CAT", "G", "TC", "A"), REF = c("A", "T", "G", "A", "A", "G", "A", "C", "A", "T", "G"), end = start + 1, seq_id = c(rep("A", 11)) ) testseq <- tibble::tibble( seq_id = "A", start = 0, end = 150, length = end - start ) p1 <- gggenomes(seqs = testseq, feats = testposition) p2 <- p1 + geom_seq() ## Scale_color_variant() # Changing the color aesthetics in geom_variant: colors all mutations # (In this example, All ALT (alternative) nucleotides are being colored) p1 + geom_variant(aes(color = ALT)) # Color all SNPs with default colors using scale_color_variant(). # (SNPs are 1 nucleotide long, other mutations such as Insertions # and Deletions have either more ore less nucleotides within the # ALT column and are thus not plotted) p1 + geom_variant(aes(color = ALT)) + scale_color_variant() # Manually changing colors with scale_color_variant() p1 + geom_variant(aes(color = ALT)) + scale_color_variant(values = c(A = "purple", T = "darkred", TC = "black", AT = "pink")) ## Scale_shape_variant() # Changing the `shape` aesthetics in geom_variant p2 + geom_variant(aes(shape = type), offset = -0.1) # Calling upon scale_shape_variant() to change shapes p2 + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() # Manually changing shapes with scale_shape_variant() p2 + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant(values = c(SNP = 14, Deletion = 18, Insertion = 21)) # Plotting (nucleotides) characters instead of shapes p2 + geom_variant(aes(shape = ALT), offset = -0.1, size = 3) + scale_shape_variant(characters = TRUE) # Alternative way to plot nucleotides (of ALT) by using `geom=text` within `geom_variant()` gggenomes(seqs = testseq, feats = testposition) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_variant(aes(label = ALT), geom = "text", offset = -0.25) + geom_bin_label() # Combining scale_color_variant() and scale_shape_variant() p2 + geom_variant(aes(shape = ALT, color = ALT), offset = -0.1, size = 3, show.legend = FALSE) + geom_variant(aes(color = ALT)) + scale_color_variant(na.value = "black") + scale_shape_variant(characters = TRUE)
# Creation of example data. testposition <- tibble::tibble( type = c( "Insertion", "SNP", "SNP", "Deletion", "Deletion", "SNP", "SNP", "Insertion", "SNP", "Insertion", "SNP" ), start = c(10, 20, 30, 35, 40, 60, 65, 90, 90, 100, 120), ALT = c("AT", "G", "C", ".", ".", "T", "C", "CAT", "G", "TC", "A"), REF = c("A", "T", "G", "A", "A", "G", "A", "C", "A", "T", "G"), end = start + 1, seq_id = c(rep("A", 11)) ) testseq <- tibble::tibble( seq_id = "A", start = 0, end = 150, length = end - start ) p1 <- gggenomes(seqs = testseq, feats = testposition) p2 <- p1 + geom_seq() ## Scale_color_variant() # Changing the color aesthetics in geom_variant: colors all mutations # (In this example, All ALT (alternative) nucleotides are being colored) p1 + geom_variant(aes(color = ALT)) # Color all SNPs with default colors using scale_color_variant(). # (SNPs are 1 nucleotide long, other mutations such as Insertions # and Deletions have either more ore less nucleotides within the # ALT column and are thus not plotted) p1 + geom_variant(aes(color = ALT)) + scale_color_variant() # Manually changing colors with scale_color_variant() p1 + geom_variant(aes(color = ALT)) + scale_color_variant(values = c(A = "purple", T = "darkred", TC = "black", AT = "pink")) ## Scale_shape_variant() # Changing the `shape` aesthetics in geom_variant p2 + geom_variant(aes(shape = type), offset = -0.1) # Calling upon scale_shape_variant() to change shapes p2 + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() # Manually changing shapes with scale_shape_variant() p2 + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant(values = c(SNP = 14, Deletion = 18, Insertion = 21)) # Plotting (nucleotides) characters instead of shapes p2 + geom_variant(aes(shape = ALT), offset = -0.1, size = 3) + scale_shape_variant(characters = TRUE) # Alternative way to plot nucleotides (of ALT) by using `geom=text` within `geom_variant()` gggenomes(seqs = testseq, feats = testposition) + geom_seq() + geom_variant(aes(shape = type), offset = -0.1) + scale_shape_variant() + geom_variant(aes(label = ALT), geom = "text", offset = -0.25) + geom_bin_label() # Combining scale_color_variant() and scale_shape_variant() p2 + geom_variant(aes(shape = ALT, color = ALT), offset = -0.1, size = 3, show.legend = FALSE) + geom_variant(aes(color = ALT)) + scale_color_variant(na.value = "black") + scale_shape_variant(characters = TRUE)
scale_x_bp()
is the default scale for genomic x-axis. It wraps
ggplot2::scale_x_continuous()
using label_bp()
as default labeller.
scale_x_bp(..., suffix = "", sep = "", accuracy = 1) label_bp(suffix = "", sep = "", accuracy = 1)
scale_x_bp(..., suffix = "", sep = "", accuracy = 1) label_bp(suffix = "", sep = "", accuracy = 1)
... |
Arguments passed on to |
suffix |
unit suffix e.g. "bp" |
sep |
between number and unit prefix+suffix |
accuracy |
A number to round to. Use (e.g.) Applied to rescaled data. |
A ggplot2 scale object with bp labels
A labeller function for genomic data
# scale_x_bp invoked by default gggenomes(emale_genes) + geom_gene() # customize labels gggenomes(emale_genes) + geom_gene() + scale_x_bp(suffix = "bp", sep = " ") # Note: xlim will overwrite scale_x_bp() with ggplot2::scale_x_continuous() gggenomes(emale_genes) + geom_gene() + xlim(0, 3e4) # set limits explicitly with scale_x_bp() to avoid overwrite gggenomes(emale_genes) + geom_gene() + scale_x_bp(limits = c(0, 3e4))
# scale_x_bp invoked by default gggenomes(emale_genes) + geom_gene() # customize labels gggenomes(emale_genes) + geom_gene() + scale_x_bp(suffix = "bp", sep = " ") # Note: xlim will overwrite scale_x_bp() with ggplot2::scale_x_continuous() gggenomes(emale_genes) + geom_gene() + xlim(0, 3e4) # set limits explicitly with scale_x_bp() to avoid overwrite gggenomes(emale_genes) + geom_gene() + scale_x_bp(limits = c(0, 3e4))
Set class of an object. Optionally append or prepend to exiting class
attributes. add_class
is short for set_class(x, class, "prepend")
.
strip_class
removes matching class strings from the class attribute vector.
set_class(x, class, add = c("overwrite", "prepend", "append")) add_class(x, class) strip_class(x, class)
set_class(x, class, add = c("overwrite", "prepend", "append")) add_class(x, class) strip_class(x, class)
x |
Object to assign new class to. |
class |
Class value to add/strip. |
add |
Possible values: "overwrite", "prepend", "append" |
Object x as class value.
Shift bins along the x-axis, i.e. left or right in the default plot layout. This is useful to align feats of interest in different bins.
shift(x, bins = everything(), by = 0, center = FALSE)
shift(x, bins = everything(), by = 0, center = FALSE)
x |
gggenomes object |
bins |
to shift left/right, select-like expression |
by |
shift each bin by this many bases. Single value or vector of the same length as bins. |
center |
horizontal centering |
gggenomes object with shifted seqs
p0 <- gggenomes(emale_genes, emale_seqs) + geom_seq() + geom_gene() # Slide one bin left and one bin right p1 <- p0 |> shift(2:3, by = c(-8000, 10000)) # align all bins to a target gene mcp <- emale_genes |> dplyr::filter(name == "MCP") |> dplyr::group_by(seq_id) |> dplyr::slice_head(n = 1) # some have fragmented MCP gene, keep only first p2 <- p0 |> shift(all_of(mcp$seq_id), by = -mcp$start) + geom_gene(data = genes(name == "MCP"), fill = "#01b9af") library(patchwork) p0 + p1 + p2
p0 <- gggenomes(emale_genes, emale_seqs) + geom_seq() + geom_gene() # Slide one bin left and one bin right p1 <- p0 |> shift(2:3, by = c(-8000, 10000)) # align all bins to a target gene mcp <- emale_genes |> dplyr::filter(name == "MCP") |> dplyr::group_by(seq_id) |> dplyr::slice_head(n = 1) # some have fragmented MCP gene, keep only first p2 <- p0 |> shift(all_of(mcp$seq_id), by = -mcp$start) + geom_gene(data = genes(name == "MCP"), fill = "#01b9af") library(patchwork) p0 + p1 + p2
Convert strand to character
strand_chr(strand, na = NA)
strand_chr(strand, na = NA)
strand |
some representation for strandedness |
na |
what to use for |
strand vector as character
Convert strand to integer
strand_int(strand, na = NA)
strand_int(strand, na = NA)
strand |
some representation for strandedness |
na |
what to use for |
strand vector as integer
Convert strand to logical
strand_lgl(strand, na = NA)
strand_lgl(strand, na = NA)
strand |
some representation for strandedness |
na |
what to use for |
strand vector as logical
Swap values of two columns based on a condition
swap_if(x, condition, ...)
swap_if(x, condition, ...)
x |
a tibble |
condition |
an expression to be evaluated in data context returning a TRUE/FALSE vector |
... |
the two columns bewteen which values are to be swapped in dplyr::select-like syntax |
a tibble with conditionally swapped start and end
x <- tibble::tibble(start = c(10, 100), end = c(30, 50)) # ensure start of a range is always smaller than the end swap_if(x, start > end, start, end)
x <- tibble::tibble(start = c(10, 100), end = c(30, 50)) # ensure start of a range is always smaller than the end swap_if(x, start > end, start, end)
Swap query and subject columns in a table read with read_feats()
or
read_links()
, for example, from blast searches. Swaps columns with
name/name2, such as 'seq_id/seq_id2', 'start/start2', ...
swap_query(x)
swap_query(x)
x |
tibble with query and subject columns |
tibble with swapped query/subject columns
feats <- tibble::tribble( ~seq_id, ~seq_id2, ~start, ~end, ~strand, ~start2, ~end2, ~evalue, "A", "B", 100, 200, "+", 10000, 10200, 1e-5 ) # make B the query swap_query(feats)
feats <- tibble::tribble( ~seq_id, ~seq_id2, ~start, ~end, ~strand, ~start2, ~end2, ~evalue, "A", "B", 100, 200, "+", 10000, 10200, 1e-5 ) # make B the query swap_query(feats)
gggenomes default theme
theme_gggenomes_clean( base_size = 12, base_family = "", base_line_size = base_size/30, base_rect_size = base_size/30 )
theme_gggenomes_clean( base_size = 12, base_family = "", base_line_size = base_size/30, base_rect_size = base_size/30 )
base_size |
base font size, given in pts. |
base_family |
base font family |
base_line_size |
base size for line elements |
base_rect_size |
base size for rect elements |
ggplot2 theme with gggenomes defaults
Named vector of track ids and types
track_ids(x, track_type, ...)
track_ids(x, track_type, ...)
x |
A gggenomes or gggenomes_layout object |
track_type |
restrict to any combination of "seqs", "feats" and "links". |
... |
unused |
a named vector of track ids and types
Use track_info()
to call on a gggenomes or gggenomes_layout object to return a short tibble
with ids, types, index and size of the loaded tracks.
track_info(x, ...)
track_info(x, ...)
x |
A gggenomes or gggenomes_layout object |
... |
unused |
The short tibble contains basic information on the tracks within the entered gggenomes object.
id : Shows original name of inputted data frame (only when more than one data frames are present in a track).
type : The track in which the data frame is present.
i (index) : The chronological order of data frames in a specific track.
n (size) : Amount of objects plotted from the data frame. (not the amount of objects in the inputted data frame)
Short tibble with ids, types, index and size of loaded tracks.
gggenomes( seqs = emale_seqs, feats = list(emale_genes, emale_tirs, emale_ngaros), links = emale_ava ) |> track_info()
gggenomes( seqs = emale_seqs, feats = list(emale_genes, emale_tirs, emale_ngaros), links = emale_ava ) |> track_info()
Unnest exons
unnest_exons(x)
unnest_exons(x)
x |
data |
data with unnested exons
Based on tidyselect::vars_pull
. Powers track selection in pull_track()
.
Catches and modifies errors from vars_pull to track-relevant info.
vars_track( x, track_id, track_type = c("seqs", "feats", "links"), ignore = NULL )
vars_track( x, track_id, track_type = c("seqs", "feats", "links"), ignore = NULL )
x |
A gggenomes or gggenomes_layout object |
track_id |
a quoted or unquoted name or as positive/negative integer giving the position from the left/right. |
track_type |
restrict to these types of tracks - affects position-based selection |
ignore |
names of tracks to ignore when selecting by position. |
The selected track_id as an unnamed string
Always returns a positive value, even if start > end. width0
is a short
handle for width(..., base=0)
width(start, end, base = 1) width0(start, end, base = 0)
width(start, end, base = 1) width0(start, end, base = 0)
start , end
|
start and end of the range |
base |
the base of the coordinate system, usually 1 or 0. |
a numeric vector
Write a gff3 file from a tidy table
write_gff3( feats, file, seqs = NULL, type = NULL, source = ".", score = ".", strand = ".", phase = ".", id_var = "feat_id", parent_var = "parent_ids", head = "##gff-version 3", ignore_attr = c("introns", "geom_id") )
write_gff3( feats, file, seqs = NULL, type = NULL, source = ".", score = ".", strand = ".", phase = ".", id_var = "feat_id", parent_var = "parent_ids", head = "##gff-version 3", ignore_attr = c("introns", "geom_id") )
feats |
tidy feat table |
file |
name of output file |
seqs |
a tidy sequence table to generate optional |
type |
if no type column exists, use this as the default type |
source |
if no source column exists, use this as the default source |
score |
if no score column exists, use this as the default score |
strand |
if no strand column exists, use this as the default strand |
phase |
if no phase column exists, use this as the default phase |
id_var |
the name of the column to use as the GFF3 |
parent_var |
the name of the column to use as GFF3 |
head |
additional information to add to the header section |
ignore_attr |
attributes not to be included in GFF3 tag list. Defaults
to internals: |
No return value, writes to file
filename <- tempfile(fileext = ".gff") write_gff3(emale_genes, filename, emale_seqs, id_var = "feat_id")
filename <- tempfile(fileext = ".gff") write_gff3(emale_genes, filename, emale_seqs, id_var = "feat_id")