Create Sequence Map Analyzes the frequency of a target sequence motif across splicing junction regions. Compares motif frequency between Retained, Excluded, and Control splicing events to identify position-specific enrichment patterns.

Usage

createSequenceMap(
  SEMATS,
  sequence,
  motif_mode = c("combined", "individual"),
  genome = NULL,
  moving_average = 40,
  WidthIntoExon = 50,
  WidthIntoIntron = 250,
  p_valueRetainedAndExclusion = 0.05,
  p_valueControls = 0.95,
  retained_IncLevelDifference = 0.1,
  exclusion_IncLevelDifference = -0.1,
  Min_Count = 50,
  groups = c("Retained", "Excluded", "Control"),
  control_multiplier = 2,
  control_iterations = 20,
  z_threshold = 1.96,
  min_consecutive = 10,
  one_sided = TRUE,
  use_fdr = TRUE,
  fdr_threshold = 0.05,
  show_significance = TRUE,
  return_data = FALSE,
  return_diagnostics = FALSE,
  verbose = TRUE,
  progress_callback = NULL,
  title = "",
  retained_col = "blue",
  excluded_col = "red",
  control_col = "black",
  line_width = 0.8,
  line_alpha = 0.7,
  ribbon_alpha = 0.3,
  title_size = 20,
  title_color = "black",
  axis_text_size = 11,
  boundary_col = "gray70",
  exon_col = "navy",
  legend_position = "bottom",
  ylab = "Frequency"
)

Arguments

SEMATS: A data frame containing SE.MATS output with columns: chr, strand, upstreamES, upstreamEE, exonStart_0base, exonEnd, downstreamES, downstreamEE, GeneID, PValue, FDR, IncLevelDifference, IJC_SAMPLE_1, SJC_SAMPLE_1, IJC_SAMPLE_2, SJC_SAMPLE_2, IncLevel1, IncLevel2
sequence: Character string or character vector of sequence motifs to search for (e.g., "YCAY" or c("YCAY", "CCCC")). Supports IUPAC ambiguity codes. When multiple motifs are provided, behaviour depends on motif_mode.
motif_mode: How to handle multiple motifs. "combined" (default) treats all motifs as a single hit set — a position counts if any motif matches there — and returns one plot. "individual" runs the full analysis independently for each motif and returns a named list of plots (one per motif). Ignored when sequence is a single motif.
genome: A BSgenome object. Default uses BSgenome.Hsapiens.UCSC.hg38.
moving_average: Integer specifying the window size for moving average smoothing. Set to NULL or 0 to disable smoothing. Default is 40.
WidthIntoExon: Integer specifying how many bp to extend into exons. Default is 50.
WidthIntoIntron: Integer specifying how many bp to extend into introns. Default is 250.
p_valueRetainedAndExclusion: P-value threshold for retained/excluded events. Default is 0.05.
p_valueControls: P-value threshold for control events. Default is 0.95.
retained_IncLevelDifference: Inclusion level difference threshold for retained events. Default is 0.1.
exclusion_IncLevelDifference: Inclusion level difference threshold for excluded events. Default is -0.1.
Min_Count: Minimum read count threshold. Default is 50.
groups: Character vector specifying which event groups to process. Options are "Retained", "Excluded", and/or "Control". Default is c("Retained", "Excluded", "Control") to process all groups.
control_multiplier: Numeric multiplier for control sample size. The number of control events sampled per iteration is (n_retained + n_excluded) * control_multiplier. Default is 2.0.
control_iterations: Integer number for sampling iterations for control sampling. The final control frequency is the mean across iterations, with standard deviation shown as a shaded band. Default is 20.
z_threshold: Z-score threshold for significance testing. Default is 1.96 Only used when use_fdr = FALSE.
min_consecutive: Minimum number of consecutive significant positions required to form a significant region. Default is 10. Helps reduce false positives from noise.
one_sided: Logical. If TRUE (default), only test for enrichment (frequency > control). If FALSE, test for both enrichment and depletion.
use_fdr: Logical. If TRUE (default), use FDR-corrected p-values (Benjamini-Hochberg) for significance testing. If FALSE, use z_threshold directly.
fdr_threshold: FDR threshold for significance when use_fdr = TRUE. Default is 0.05.
show_significance: Logical. If TRUE (default), displays colored bars above the plot indicating regions where Retained/Excluded differ significantly from Control based on z-test.
return_data: Logical. If TRUE, returns the frequency data frame instead of a plot. Default is FALSE.
return_diagnostics: Logical. If TRUE, returns a list containing the frequency data, raw bootstrap iteration results (for normality testing), and significance results. Useful for validating bootstrap assumptions. Default is FALSE.
verbose: Logical. If TRUE, prints progress messages. Default is TRUE.
progress_callback: Optional function to report progress. Called with two arguments: current iteration number and total iterations. Default is NULL.
title: Character string for the plot title. Default is "" (no title).
retained_col: Color for the Retained group line. Default is "blue".
excluded_col: Color for the Excluded group line. Default is "red".
control_col: Color for the Control group line. Default is "black".
line_width: Numeric line width for the frequency lines. Default is 0.8.
line_alpha: Numeric alpha (opacity) for the frequency lines. Default is 0.7.
ribbon_alpha: Numeric alpha for the SD ribbon around Control. Default is 0.3.
title_size: Numeric font size for the plot title. Default is 20.
title_color: Color for the plot title text. Default is "black".
axis_text_size: Numeric font size for y-axis tick labels. Default is 11.
boundary_col: Color for the dashed vertical boundary lines. Default is "gray70".
exon_col: Fill color for the skipped (middle) exon in the schematic. Default is "navy".
legend_position: Position of the legend. Default is "bottom".
ylab: Label for the y-axis. Default is "Frequency".

Value

A ggplot object showing sequence frequency across the 4 regions for Retained, Excluded, and Control groups. Significant regions (z-test vs Control) are shown as colored bars above the plot. Returns a data frame if return_data = TRUE. When motif_mode = "individual" and multiple motifs are supplied, returns a named list of ggplot objects (or data frames), one entry per motif.

Details

The function divides each splicing event into 4 regions of (WidthIntoExon + WidthIntoIntron) bp each:

Region 1: Upstream exon end to first intron
Region 2: First intron end to middle (skipped) exon start
Region 3: Middle exon end to second intron
Region 4: Second intron end to downstream exon start

Events are filtered into three groups:

Retained: Significant events (PValue < threshold) with positive inclusion
Excluded: Significant events (PValue < threshold) with negative inclusion
Control: Non-significant events

At each position, the function checks if the target sequence starts there. The frequency is calculated as: (events with motif at position) / (total events)

Examples

if (FALSE) { # \dontrun{
library(BSgenome.Hsapiens.UCSC.hg38)

# Single motif — basic usage (unchanged)
createSequenceMap(SEMATS = sample_se.mats, sequence = "YCAY")

# Multiple motifs — combined: one plot, hit if any motif matches
createSequenceMap(SEMATS = sample_se.mats,
                  sequence = c("YCAY", "CCCC"),
                  motif_mode = "combined")

# Multiple motifs — individual: named list of plots, one per motif
plots <- createSequenceMap(SEMATS = sample_se.mats,
                            sequence = c("YCAY", "CCCC", "GGGG"),
                            motif_mode = "individual")
plots[["YCAY"]]
plots[["CCCC"]]
} # }