Find Best Matches using K-mer Spectra

The Find Best Matches using K-mer Spectra tool is inspired by Hasman2013 and Larsen2014 and enables identification of the best matching reference among a specified reference sequence list. This section intends to describe the tool if you would like to use it as such. However, we recommend to use the Type a Known Species or Type Among Multiple Species template workflows instead as described in the chapter Workflow templates.

To identify best matching bacterial genome reference, go to:

        Toolbox | Microbial Genomics Module (Image mgm_folder_closed_flat_16_h_p) | Typing and Epidemiology (beta) (Image typing_epi_folder_closed_16_h_p) | Find Best Matches using K-mer Spectra (Image find_best_kmer_spectra_16_h_p)

  1. In the first wizard window, select the sequences you want want to find a best match sequence for (figure 11.1). Click on Next.

    Image find_best_match1
    Figure 11.1: To identify best matching reference, specification of read file is the first step.

  2. In the next wizard window, a number of settings may be specified (figure 11.2).

    Image find_best_match2
    Figure 11.2: Specify reference list to search across.

    • References may be a single- or multiple list(s) of sequences. It is for example possible to use the full NCBI's bacterial genomes database, or subset(s) of it.
    • K-mer length is the fixed number (k) of DNA bases to search across.
    • Only index k-mers with prefix allows specification of the initial bases of the k-mer sequence to limit the search space.
    • Check for low quality and contamination will perform a quality check of the input data and identify potential contaminations.
    • Fraction of unmapped reads for quality check defines the contamination tolerance as the fraction of the total number of reads not mapping to the best reference.

  3. In the last wizard window, the tool provides the following output options (figure 11.3).

    Image find_best_match4
    Figure 11.3: Choose your output option before saving your results.

    • Output Best Matching Sequence is the best matching genome within the provided reference sequence list(s).
    • Output Best Matching Sequences as a List includes the best matching genomes ordered with the best matching reference sequence first. The list is capped at 100 entries. Content is the same as in the Output Report Table.
    • Output Report Table represents the best matching sequence. It lists all significantly matching references including various statistical values (as describe in Hasman2013 and Larsen2014). The list is capped at 100 entries and the column headers are defined as such:
      • Score Numbers of k-mers from the database seen in the reads.
      • Expected The expected value, i.e., what score should been for the Z-score to be 0 and thus the P-value to be 1.
      • Z Calculated Z-score.
      • P Z-score translated to two-sided P-value.
      • P, corrected P-value with Bonferroni correction.
    • Output Quality Report gives a report with some statistics on possible contaminations and coverage reports for the read mappings. This option is available if the option Check for low quality and contamination was selected in the first wizard window. This report contains the metadata:
      • Best match, % mapped Percent of reads mapping to the best matching reference.
      • Contaminating species, % mapped (taxonomy info) Percent of mapping reads and the most specific accessible taxonomy information for the most probable contaminant.
    • Output read mapping to best match gives the mapping of the reads to the best matching reference. This option is available if the option Check for low quality and contamination was selected in the first wizard window.
    • Output read mapping to contaminants if a contamination is detected, this generates the mapping of the reads (which do not map to the best reference) to the probable contaminants. This option is available if the option Check for low quality and contamination was selected in the first wizard window.

  4. Save your data by first specifying a location to save it and then click on Finish.

  5. To add the obtained best match to a Result Metadata Table, see the section Add to Result Metadata Table. Note that Best match results are added automatically to Result Metadata Table when using the template Type a Known Species and/or Type Among Multiple Species workflow(s) or their customized versions.

Note that the lists of references found given from the Output Best Matching Sequences as a List and Output Quality Report may differ. The reason is that the former list is compiled based on a "Winner takes all" based count of K-mers which attributes all uniquely found K-mers only to the reference with the highest Z-score while the latter list is produced by removing all reads mapping to the best matching reference and use the remaining reads as a basis for determining the next best match. Thus, in the second round the pool of K-mers has been altered and some K-mers, which determined the Z-score of the original second-best match, may have been removed. However, in practice there is a strong correspondence between these two lists.

Once results from the Find Best Matches using K-mer Spectra tool are added to the Result Metadata Table, extra columns are present in the table, including the taxonomy of the best matching references, and in case the quality control was activated, the percentage of reads mapping to the best reference and the most probable contaminating species (see figure 11.4).

Image find_best_match3
Figure 11.4: Taxonomy of the best matching reference and quality information is shown in the Metadata Result Table.