PlantTFDB
Plant Transcription Factor Database
v2.0
Center for Bioinformatics, Peking University, China Previous version
Pipeline to construct comprehensive protein dataset
Species with genome sequence
For species whose genomic sequences were available, protein sequences based on genome annotation were the main sources for the protein data set. Meanwhile, for those species have been collected in RefSeq, annotations from RefSeq were added as supplements to genome annotations. Furthermore, EST-based data from PlantGDB and UniGene were used to supply annotations missed by genome annotation and RefSeq records. Following steps were used to get a non-redundant protein data set:
  1. RGset (RefSeq + Genome):
    1. Filtering out putative pseudogenes (those have * within protein sequences) in genome annotations and clustering identical sequences from genome annotation and RefSeq (if have) by MD5 checksum. The resulted protein set is called RGset.
  2. PUset (PlantGDB + UniGene)
    1. For PUT from PlantGDB and EST (uniq unigene) from UniGene, identifying coding sequence (CDS) and corresponding peptide sequence by ESTScan with CDS length>=150 and score >=200.
    2. Mapping the CDS to the genome by blat (identity >=0.95, coverage>=0.9, no deletion in CDS), and filtering out the CDS that cannot be mapped to the genome.
    3. Removing redundant proteins coding by CDS from b) against RGset. In protein level, cd-hit-2d was used (identity >= 0.85, coverage >= 0.9). In nucleic acid level, cd-hit-est-2d was used (identity >= 0.95, coverage >= 0.9).
    4. Filtering out those proteins whose 'x' content was greater than 0.05.
    5. Clustering proteins from d) by blastclust (identity >= 0.95 and coverage >= 0.9), and the resulted protein set is called PUset.
  3. RGset and PUset are combined as a comprehensive protein set.
Pipeline for species with genome sequence
Species without genome sequence
For species whose genomic sequences were not available, EST-based data from PlantGDB and UniGene were used as the main sources to construct protein data set. Following steps were used to get a non-redundant protein data set:
  1. Identifying coding sequence (CDS) and corresponding peptide sequence by ESTScan with CDS length>=150 and score >=200.
  2. Filtering out those proteins whose 'x' content is greater than 0.05.
  3. Clustering proteins by blastclust (identity >= 0.95 and coverage >= 0.9), and the resulted protein set is called PUset.
Pipeline for species with genome sequence