SlidingWindowSearch - search a string for specified patterns using a sliding window
use SlidingWindowSearch qw(ListMatches CountMatches ParsePatternSpecFile) @matchList = ListMatches($seq,$pattern) @matchList = ListMatches($seq,$pattern,$patWidth,$patWidthExtended) %matchCount = CountMatches($seq,@patList) %matchCount = CountMatches($seq,@patList, $patWidth,$patWidthExtended,$method,%matchCount) ($iFail, $patterns_ref,$patWidth,$patWidthExtended) = ParsePatternSpecFile($filename,@lines)
SlidingWindowSearch provides routines to search a string for substrings matching a specified pattern or patterns, using a sliding window approach. The matching substrings may be specified to either match the length of the supplied pattern(s) exactly; or alternatively, if an "extended pattern length" is specified, then any substring of this length which contains the specified pattern will be considered to match.
A routine is also provided to parse text files containing specifications of the patterns.
One application of the routines is to identify polypeptide fragments created by the action of protease enzymes on proteins that can be recognised and displayed by specified MHC Class I and Class II receptors.
The following main routines are provided:
ListMatches($seq,$pattern,$patWidth,$patWidthExtended)
Returns a list of the sub-strings of $seq
that match the pattern $pattern
(specified as a Perl regular expression).
$patWidth
specifies the width of
the pattern, i.e. the number of characters of the substring to be matched.
$patWidthExtended
specifies the width of the sub-string sought such that
a match of the pattern at any position within the sub-string should be accepted.
For the standard routine provided, $patWidth
and $patWidthExtended
are required
only if the extended pattern matching described above is sought.
An alternative implementation, SlidingWindowSearch::ListMatches_UsingGrep
,
is also available. For that implementation, one or both of either
$patWidth
or $patWidthExtended
must always be specified.
Matching sub-strings are returned in the order in which they occur in the full string, with duplicates included.
CountMatches($seq,@patList,$patWidth,$patWidthExtended)
Returns a hash containing the sub-strings that match one or more of the
supplied list of patterns @patList
, and how many of those patterns
each sub-string matched. (Multiple matches to the same pattern are
suppressed).
The arguments $patwidth
and $patWidthExtended
have the same meanings
as for ListMatches
above.
Additional arguments $method
and %matchCount
can also be supplied.
$method
is a text string specifying the implementation of ListMatches
to be used. Currently acceptable arguments are 'SlidingWindow' (the default),
and 'UsingGrep' (as described above).
The optional final argument %matchCount
allows an existing hash
%matchCount
from a previous call to the subroutine to be updated.
ParsePatternSpecFile($filename,@lines)
Parses lines that have been read from a pattern specification file,
returning a list ($iFail,$patterns_ref,$patWidth,$patWidthExtended)
containing an exit state; a reference to a list of the patterns
themselves, translated into Perl regular expressions; the specified
width of the patterns, and any specified extended width for the
patterns. (If no such extended width has been specified, this
entry will be undefined).
An exit state of 0 indicates success. Non-zero exit indicates failure, which is reported with the line number of the specification file, and an error message. (See code for details).
The variable $filename
is used only for convenience, to report
that processing of the file is underway.
Set and Get internal global variable $debug
.
SetDebug(1)
produces more detailed progress information, including
a count of the numbers of sub-strings matching each pattern.
Set and Get internal global variable $silent
.
SetSilent(1)
prevents the logging of certain information, including
a listing of the patterns to be matched, and the length of the string
to be analysed.
Set and Get internal global variable $checks
.
SetChecks(0)
prevents checking for blanks, control codes
and other non-[A-Z] characters in the supplied string to be
searched.
Pattern specification files may specify a number of patterns, each separated by a blank line. The specifications should otherwise be contiguous: blank lines will be taken as signifying the start of a new pattern.
Lines beginning with # will be treated as comments, and ignored. Such lines may contain eg the names of MHC allele binding motifs from which each pattern has been derived.
Specification lines specifying the pattern width and (optionally) the extended pattern width must appear before any patterns are specified. These specification lines start with a *, eg:
* Pattern width = 9 * Extended pattern width = 15
These lines may be repeated, but the values may not be changed.
Each pattern is specified by giving the test position, followed by the character values allowed at that position, eg:
1 = A, G, P 3 = N, D
Commas and spaces are optional; a colon may be used instead of the '='
No symbols are exported by default.
Routines ListMatches
, CountMatches
and ParsePatternSpecFile
can
be made available using
use SlidingWindowSearch qw(ListMatches CountMatches ParsePatternSpecFile) or alternatively
use SlidingWindowSearch ':all'
No support is offered.
Error states should probably by signalled back to the calling program for handling, rather than the module generating fatal errors.
Is there any simple way to get Perl to work out the width of a general (fixed-width) regex from the regex itself? Haven't thought of one...
Might be sensible to pass references to $str
around, rather than copying
the whole string itself. But a downside is that then constant strings,
eg as used in the unit tests, need special attention; and if one tries
to hide the reference-ness from the user using subroutine prototypes, it
then becomes rather confusing that they can't call the subroutine with
a constant string...
Some formatting glitches being produced by pod2html. Not sure why they're happening or how to sort them out...
James Heald <j.heald@ucl.ac.uk>
Copyright 2012 by James Heald
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.