FAQ

Is there a windows version?

No. The main problem is here that some of the properitary software used in LotuS2 is not Windows compatible. Therefore we will not release a Windows version any time soon.

I’m writing my paper but I have no idea what software (and what version of the software) was used inside LotuS2. Do I need to report this at all?

Yes. LotuS2 uses a lot of external software for the analysis and thus we strongly recommend citing and mentioning all the resources (software, databases) used in a LotuS2 run with their versions (also important for reproducability). To make this as easy as possible, each LotuS2 run has a file “/LotuSLogS/citations.txt”. This lists all the external softwares that were used in a run, their citation, and their software version. The LotuS2 version is reported in “/LotuSLogS/LotuS_run.log” as well as some statistics on the pipeline run that might be useful in reporting the methods.

I need to upload my demultiplexed data to a public repository, how do I get the demultiplexed files?

Use the option "-saveDemultiplex 1".

There are a lot of options, is it necessary to change any of these?

No. LotuS2 comes with a set of robust default parameters that work very well in most cases. However, to get the most out of your data, you might want to fine-tune and customize some of these. As an example -offtargetsdb can be specified to remove host contaminants and -saveDemultiplex to save your demultiplexed data in different formats. You can also change the cutoff length of sequences and many other settings in the sdm config file (see the sdm related FAQs). Therefore, having a look at the documentation on the website is recommended.

I have already OTUs/ASVs (from a previous LotuS run/ from another pipeline), can I just use LotuS2 to assign their taxonomy with the various algorithms implemented?

Yes. With using “lotus2 -taxOnly YourOTUs.fasta” LotuS2 will only output the tax assignments. With further arguments, e.g. “lotus2 -taxOnly YourOTUs.fasta -refDB HITdb,SLV, -taxAligner lambda -LCA_cover 0.5 -LCA_frac 0.1” you can assign your OTUs with HITdb and then classify the remaining unknown OTUs with SILVA and also specifiy the LCA (lowest common ancestor) parameters to determine the reported taxonomy.

How can I look for sequencing primers in my raw sequence input? Are they automatically removed?

It is quite important to make sure you have high quality amplicons and LotuS2 will automatically remove primer sequences, if detected. There are two ways to feed LotuS2 the target primer sequences:
The easier way is to use commandline arguments, e.g. `lotus2 … -forwardPrimer TCCGGTTGATCCYGCBRG -reversePrimer GGCCATGCAMYWCCTCTC`
The second way is to define in your mapping file two additional columns with the header “ForwardPrimer” and “ReversePrimer”, you have to provide the primer now in each entry in these columns (e.g. TCCGGTTGATCCYGCBRG and GGCCATGCAMYWCCTCTC in the above example).
Note that LotuS2 can also work with only fwd or rev primer supplied, but it’s cleaner to work with both. Also, LotuS2 can recognize ambiguous bases (i.e. M, Y, W) in primer sequences.

How do I remove reads that might still have a primer attached/did not get their primer removed properly?

It is recommended to remove sequencing primers via LotuS2, if still present in input sequences (see above answer). However, a substantial fraction of amplicons typically does not have primer sequences for various technical reasons. In some cases, you might decide that that is ok, but in general it is recommended to at least ensure a fwd primer is found (and removed). This practice will lead to much cleaner OTU/ASV clusters and is an additional quality check.
How the pipeline handles reads without primers is determined in the sdm config files. Here the entries “RejectSeqWithoutRevPrim T” and “RejectSeqWithoutFwdPrim T” would instruct the pipeline to remove any amplicons that do not have both fwd and rev primer. Changing only “RejectSeqWithoutRevPrim F” would result in reads that without rev primer to still be used in clustering, as long as a fwd primer is detected (and removed). This might be advisable if a large fraction of reads fails to have a detectable rev primer (can sometimes be the case in miSeq paired-end runs). Note that primarily the 2nd read pair will be checked for the presence of rev primers in the case of Illumina amplicon sequencing.
Note that these options are different from the “*RejectSeqWithoutFwdPrim” and “*RejectSeqWithoutRevPrim” options. These options apply only to mid-quality reads and are by default set to “F” so mid quality reads are not rejected when fwd/rev primer are not detected. This is because mid-quality reads are not used in OTU/ASV generation, but only later if they match to existing clusters, and here not having a fwd/rev primer is not considered essential (but you can change these options to “T” if you need to).

Generating a map for my experiment is tedious, is there a faster way?

Yes. If your reads are already demultiplexed so that each fastq file/file pair represents a single sample you can use the command “lotus2 -create_map mymap.txt -i /home/dir_with_demultiplex_fastq”.

What is the difference between sdm and LotuS2?

sdm (simple demultiplexer) is an integral part of LotuS2, responsible for quality filtering, demultiplexing, sequence format changes and seed extension. However, sdm was conceptualized as a stand-alone software. E.g. I personally use sdm to quality filter sequence files before assembling bacterial genomes. To get more information of the sdm interface, execute the sdm binary without arguments ("./sdm") and a help is displayed.

Can I modify the sdm config files?

Yes. We provide several default sdm configuration files, for example depending on the target gene or sequencing platform used. The files can be found in the configs/ folder (or in the conda folder for LotuS2 when installed via conda), that you can provide to the pipeline by using the flag “-s configs/sdm_miSeq.txt”. If you want to modify the config make sure to copy the config file so you do not modify the original.

Can I use gzip compressed files?

Yes. For config and mapping files this is not supported but all sequence files can be compressed using gzip. Just make sure the file ending is ".gz" and sdm will assume this is a compressed file. Note that on some systems sdm compilation with zlib library may fail; the autoinstaller attempts to detect this and compile sdm without zlib support.

What part of the sequence is cut?

Everything that is the remainder of technical processes is removed, if possible. E.g. Giving Barcodes in the sequence, will remove all sequence upstream of the Barcodes (including heterogenity spacer, illumina primer). If Fwd and Rev 16S amplification Primers are provided in the mapping file (and they are found in a read), everything upstream of these is removed (including Barcodes, het spacer etc.).

Should I keep unclassified OTUs (-keepUnclassified option)?

In general we do not recommend this, as these sequences could be environmental sequences that are not 16S (e.g. eukaryotic genomes contain regions with distant homology to frequently used 16S primer pairs). If you assume that you might have new phyla in your sample or species very distant from known organisms, you can deactivate this option, but I would still recommend to cross check with e.g. NCBI Blast that an unknown OTU is not a random gene. This option is activated by default, because it was confusing if a large part of reads went silently missing.

My Barcodes are reverse complemented, can I set an option to take care of this?

This should be automatically detected: sdm has an in-built algorithm that checks in the first 5000 sequences, if more reverse complemented BCs can be detected and will use this information for the rest of the file. However, BCs have to be consistent in their direction, as the direction information is assummed to be the same within each file.

I do not want to use RDP assigned taxonomies, but use reference databases. Should I use the SILVA or Greengenes 16S ref databases?

Both databases have a large selection of taxa included, though SILVA has a faster release cycle and is currently more up to date, the last GG release was in 2013. Also, Silva includes LSU and Eukaryotic (18S/28S) sequences, so greengenes can only be used for bacterial SSU (16S) sequences.

How to choose a good cutoff length of sequences?

Changing the TruncateSequenceLength and minSeqLength is fine tuning to your data set - just remember to keep these parameters equal. As a general rule of thumb: you want to have as long as possible reads, but every read below that length will be excluded. Further, the accumulated error has to be below e.g. 0.5 (parameter maxAccumulatedError), so longer reads means more errors and here you have to find a good balance between read errors and sequence length. (All parameters are in the sdm_XXX.txt option files)

How to further optimize my LotuS2 run?

First of all you need to optimize the number of sequences you gain vs the number of errors you allow to pass into OTU building. This is mainly done in the sdm_opt* files, the files I provide on the website are just general purpose suggestions.
Second, choose your clustering algorithm according to your needs. UPARSE is my general recommendation; some users have reported better clusterings in the usearch7x versions. SEED clusters are very sensitive, to a point where read errors could cluster into a new OTU, but if you need pseudo-strain resolution this might work for you. cd-hit clusters are very uniform, that is no dynamic adaptation of identity deprending on cluster size/shape like uparse and swarm do. These are plain "good old clusters".
Third, think about what taxonomic assignments you need and from which database. RDP provides often a very robust assignment at genus level, but greengenes/Silva can allow annotations at species level.

LotuS2 does not yield any OTUs at species level with our samples, neither with the GG nor SLV database. (But QIIME does)?

Depends on the environment you work in. So for gut environment, you should get a good fraction of OTUs assigned to species level; other environments like Arctic samples are often not well represented at species level.
In LotuS2 we avoid the best-hit-assignment (unless specified with option -useBestBlastHitOnly). LotuS2 has a least common ancestor algorithm that looks if there are several hits of similar quality to different species of the same genus/family/class etc. It then goes to the node of hits that capture 95% of hits (with some additional checks if reference sequences even have a species assignment etc). Further, if the identity of the hit is below a certain threshold (set in the lotus.cfg file), it will not assign species, genus, family etc. labels, if not higher than e.g. 95% similarity to the database hit. This is to prevent falsely assigned species names, even if this means retuning a lot of genera without species assignments.

Where are reads exactly removed during the LotuS2 run

1st) during Quality filtering and also dereplication (these are later counted into the OTU matrix by similarity comparison, but not used for OTU construction). 2nd) is during Chimera detection steps. 3rd) (optional) unassigned OTU's and all associated reads can be removed (option -keepUnclassified 0). Additionally, if you used -lulu 1 and -offtargetsdb options, the reads will be removed after removal of ASVs identified to be contaminants. Number of reads removed by each step can be tracked in the LotuSLogs files.

How should I process samples that were sequenced in different runs for error profiling with dada2?

If you use dada2 with multiple sequencing runs, in principle you need to run dada2 separately for each run. You need to define the sequencing run for each sample in your mapping file in a column called as “SequencingRun”. Please see the documentation for further information.

How can I cite LotuS2 in my work?

Cite:
Bedarf, J.R., Beraza, N., Khazneh, H. et al. Much ado about nothing? Off-target amplification can lead to false-positive bacterial brain microbiome detection in healthy and Parkinson’s disease individuals. Microbiome 9, 75 (2021). https://doi.org/10.1186/s40168-021-01012-1
The paper is available here.