Genome Survey Sequencing Analysis
Description
Genome survey sequencing is based on sequencing technology and involves low-depth sequencing of small fragment libraries. Through K-mer analysis, it rapidly provides fundamental information such as genome size, heterozygosity, and repeat sequence proportion, offering an effective basis for developing de novo whole-genome sequencing strategies for the species.
Principle of Survey Plot Analysis
Survey plot analysis is based on k-mer methods. A k-mer refers to short sequences of nucleotides of length kkk, where kkk is a constant, typically an odd number to avoid confusion between the forward and reverse strands. The method involves dividing nucleotide sequences into overlapping segments of kkk bases and counting the frequency or depth of each k-mer type across all reads.
A frequency distribution plot is created for k-mers of different depths, and the peak with the highest k-mer distribution is identified as the primary peak. The genome size is then estimated as the total number of k-mers divided by the depth of the primary peak.
Due to the presence of heterozygous sites and repeat sequences in the genome, the k-mer distribution curve often deviates from a perfect Poisson distribution. Instead, additional peaks may appear before and after the primary peak. If there is significant heterozygosity, a secondary peak may appear at half the horizontal coordinate of the primary peak. Similarly, if there is a certain level of repeat sequences, additional peaks may occur at integer multiples of the horizontal coordinate of the primary peak.
Survey Plot Analysis Content
- Assess genome size
- Evaluate genome heterozygosity
- Determine repeat sequence content
- Assess genome GC content
- Provide strategic recommendations for library construction in the subsequent detailed assembly phase
Significance of Genome Survey Plots
- A necessary prerequisite for initiating whole-genome sequencing
- Understanding genomic differences with closely related species
- Obtaining basic information and assessing the complexity of the genome of a particular species