NAME

SlidingWindowSearch - search a string for specified patterns using a sliding window


SYNOPSIS

  use SlidingWindowSearch qw(ListMatches CountMatches ParsePatternSpecFile)
  @matchList = ListMatches($seq,$pattern)
  @matchList = ListMatches($seq,$pattern,$patWidth,$patWidthExtended)
  %matchCount = CountMatches($seq,@patList)
  %matchCount = CountMatches($seq,@patList,
       $patWidth,$patWidthExtended,$method,%matchCount)
  ($iFail, $patterns_ref,$patWidth,$patWidthExtended) 
       = ParsePatternSpecFile($filename,@lines)


DESCRIPTION

SlidingWindowSearch provides routines to search a string for substrings matching a specified pattern or patterns, using a sliding window approach. The matching substrings may be specified to either match the length of the supplied pattern(s) exactly; or alternatively, if an "extended pattern length" is specified, then any substring of this length which contains the specified pattern will be considered to match.


A routine is also provided to parse text files containing specifications of the
patterns.

One application of the routines is to identify polypeptide fragments created by
the action of protease enzymes on proteins that can be recognised and displayed
by specified MHC Class I and Class II receptors.

The following main routines are provided:

@matchList = ListMatches($seq,$pattern,$patWidth,$patWidthExtended)

Returns a list of the sub-strings of $seq that match the pattern $pattern (specified as a Perl regular expression).

$patWidth specifies the width of the pattern, i.e. the number of characters of the substring to be matched. $patWidthExtended specifies the width of the sub-string sought such that a match of the pattern at any position within the sub-string should be accepted.

For the standard routine provided, $patWidth and $patWidthExtended are required only if the extended pattern matching described above is sought.

An alternative implementation, SlidingWindowSearch::ListMatches_UsingGrep, is also available. For that implementation, one or both of either $patWidth or $patWidthExtended must always be specified.

Matching sub-strings are returned in the order in which they occur in the full string, with duplicates included.

%matchCount = CountMatches($seq,@patList,$patWidth,$patWidthExtended)

Returns a hash containing the sub-strings that match one or more of the supplied list of patterns @patList, and how many of those patterns each sub-string matched. (Multiple matches to the same pattern are suppressed).

The arguments $patwidth and $patWidthExtended have the same meanings as for ListMatches above.

Additional arguments $method and %matchCount can also be supplied. $method is a text string specifying the implementation of ListMatches to be used. Currently acceptable arguments are 'SlidingWindow' (the default), and 'UsingGrep' (as described above).

The optional final argument %matchCount allows an existing hash %matchCount from a previous call to the subroutine to be updated.

ParsePatternSpecFile($filename,@lines)

Parses lines that have been read from a pattern specification file, returning a list ($iFail,$patterns_ref,$patWidth,$patWidthExtended) containing an exit state; a reference to a list of the patterns themselves, translated into Perl regular expressions; the specified width of the patterns, and any specified extended width for the patterns. (If no such extended width has been specified, this entry will be undefined).

An exit state of 0 indicates success. Non-zero exit indicates failure, which is reported with the line number of the specification file, and an error message. (See code for details).

The variable $filename is used only for convenience, to report that processing of the file is underway.

ADDITIONAL CONVENIENCE ROUTINES

SlidingWindowSearch::SetDebug($debugLevel)
SlidingWindowSearch::GetDebug()

Set and Get internal global variable $debug.

SetDebug(1) produces more detailed progress information, including a count of the numbers of sub-strings matching each pattern.

SlidingWindowSearch::SetSilent($silentLevel)
SlidingWindowSearch::GetSilent()

Set and Get internal global variable $silent.

SetSilent(1) prevents the logging of certain information, including a listing of the patterns to be matched, and the length of the string to be analysed.

SlidingWindowSearch::SetChecks($checksLevel)
SlidingWindowSearch::GetChecks()

Set and Get internal global variable $checks.

SetChecks(0) prevents checking for blanks, control codes and other non-[A-Z] characters in the supplied string to be searched.

FORMAT FOR PATTERN SPECIFICATION FILES

Pattern specification files may specify a number of patterns, each separated by a blank line. The specifications should otherwise be contiguous: blank lines will be taken as signifying the start of a new pattern.

Lines beginning with # will be treated as comments, and ignored. Such lines may contain eg the names of MHC allele binding motifs from which each pattern has been derived.

Specification lines specifying the pattern width and (optionally) the extended pattern width must appear before any patterns are specified. These specification lines start with a *, eg:

    * Pattern width = 9
    * Extended pattern width = 15

These lines may be repeated, but the values may not be changed.

Each pattern is specified by giving the test position, followed by the character values allowed at that position, eg:

    1 = A, G, P
    3 = N, D

Commas and spaces are optional; a colon may be used instead of the '='


EXPORTS

No symbols are exported by default.

Routines ListMatches, CountMatches and ParsePatternSpecFile can be made available using

    use SlidingWindowSearch qw(ListMatches CountMatches ParsePatternSpecFile)
                
or alternatively
    use SlidingWindowSearch ':all'


SUPPORT

No support is offered.


TODO


AUTHOR

James Heald <j.heald@ucl.ac.uk>


COPYRIGHT AND LICENSE

Copyright 2012 by James Heald

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.