Bioplib
Protein Structure C Library
|
Read a PIR sequence file. More...
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include "SysDefs.h"
#include "macros.h"
#include "seq.h"
Go to the source code of this file.
Functions | |
int | blReadPIR (FILE *fp, BOOL DoInsert, char **seqs, int maxchain, SEQINFO *seqinfo, BOOL *punct, BOOL *error) |
Read a PIR sequence file.
This code is NOT IN THE PUBLIC DOMAIN, but it may be copied according to the conditions laid out in the accompanying file COPYING.DOC.
The code may be modified as required, but any modifications must be documented so that the person responsible can be identified.
The code may not be sold commercially or included as part of a commercial product except as described in the file COPYING.DOC.
This version attempts to read any PIR file following the PIR specifications. It also accepts a few non-standard features: lower case sequence, no star at end of last chain, dashes in the sequence to indicate insertions.
See also:
Definition in file ReadPIR.c.
int blReadPIR | ( | FILE * | fp, |
BOOL | DoInsert, | ||
char ** | seqs, | ||
int | maxchain, | ||
SEQINFO * | seqinfo, | ||
BOOL * | punct, | ||
BOOL * | error | ||
) |
[in] | *fp | File pointer |
[in] | DoInsert | TRUE Read - characters into the sequence FALSE Skip - characters |
[in] | maxchain | Max number of chains to read. This is the dimension of the seqs array. N.B. THIS SHOULD BE AT LEAST 1 MORE THAN THE EXPECTED MAXIMUM NUMBER OF SEQUENCES |
[out] | **seqs | Array of character pointers which will be filled in with sequence information. Memory will be allocated for any sequence length. |
[out] | *seqinfo | This structure will be filled in with extra information about the sequence. Header & title information and details of any punctuation. |
[out] | *punct | TRUE if any punctuation found. |
[out] | *error | TRUE if an error occured (e.g. memory allocation) |
This is an all-singing, all-dancing PIR reader which should handle all legal PIR files and some (slightly) incorrect ones. The only requirements of the code are that the PIR file should have 2 title lines per entry, the first line starting with a > sign.
The routine will handle multiple sequence files. Successive calls will return information on the next entry. The routine will return 0 when there are no more entries.
Header line: Must start with >. Will handle files which don't have the proper P1; or F1; parts of the header as well as those which do.
Title line: Will read the name and source fields if correctly separated by a -, otherwise copies all information into the name.
Sequence: May contain allowed puctuation. This will set the punct flag and information on the types found will be placed in seqinfo. White space and line breaks are ignored. Each chain should end with a *, but the routine will accept the last chain of an entry with no . While the standard requires upper case text, this routine will handle lower case and convert it to upper case. While the routine does pretty well at last chains not terminated with a *, a last chain ending with a / not followed by a * but followed by a text line will be identified as incomplete rather than truncated. If the DoInsert flag is set, - signs in the sequence will be read as part of the sequence, otherwise they will be skipped. This is an addition to the PIR standard.
Text lines: Text lines after an entry (beginning with R;, C;, A;, N; or F;) are ignored.