![]() |
Plant Transcription
Factor Database
v2.0
Center for
Bioinformatics, Peking University,
China Previous version
|
| Home | Blast | Search | Download | WebService | Help | About | Links |
Pipeline to construct comprehensive protein dataset
Species with genome sequence
For species whose genomic sequences were
available, protein sequences based on genome annotation were the main
sources for the protein data set. Meanwhile, for those species have been
collected in RefSeq, annotations from RefSeq were added as supplements
to genome annotations. Furthermore, EST-based data from PlantGDB and
UniGene were used to supply annotations missed by genome annotation and
RefSeq records. Following steps were used to get a non-redundant protein
data set:
- RGset (RefSeq + Genome):
- Filtering out putative pseudogenes (those have * within protein sequences) in genome annotations and clustering identical sequences from genome annotation and RefSeq (if have) by MD5 checksum. The resulted protein set is called RGset.
- PUset (PlantGDB + UniGene)
- For PUT from PlantGDB and EST (uniq unigene) from UniGene, identifying coding sequence (CDS) and corresponding peptide sequence by ESTScan with CDS length>=150 and score >=200.
- Mapping the CDS to the genome by blat (identity >=0.95, coverage>=0.9, no deletion in CDS), and filtering out the CDS that cannot be mapped to the genome.
- Removing redundant proteins coding by CDS from b) against RGset. In protein level, cd-hit-2d was used (identity >= 0.85, coverage >= 0.9). In nucleic acid level, cd-hit-est-2d was used (identity >= 0.95, coverage >= 0.9).
- Filtering out those proteins whose 'x' content was greater than 0.05.
- Clustering proteins from d) by blastclust (identity >= 0.95 and coverage >= 0.9), and the resulted protein set is called PUset.
- RGset and PUset are combined as a comprehensive protein set.

Species without genome sequence
For species whose genomic sequences were not
available, EST-based data from PlantGDB and UniGene were used as the
main sources to construct protein data set. Following steps were used to
get a non-redundant protein data set:
- Identifying coding sequence (CDS) and corresponding peptide sequence by ESTScan with CDS length>=150 and score >=200.
- Filtering out those proteins whose 'x' content is greater than 0.05.
- Clustering proteins by blastclust (identity >= 0.95 and coverage >= 0.9), and the resulted protein set is called PUset.






