No explicit installation is required for LSC. You may copy the LSC binaries to any location as long as all the binaries (including Novoalign) are in the same directory or path.
But you need to Python2.6 installed in your computer. The modules "numpy" and "scipy" are also required. Please see runLSC.py run.cfg for more details
Firstly, see the tutorial on how to use LSC on some example data.
In order to use LSC on your own data:
- Create an empty directory, this will be the working directory.
- Copy "run.cfg" from the LSC package to the working directory.
- Edit run.cfg to include paths to your data files and the paths of the temp folder and the output folder. You may also want to configure the default settings.
- Execute "/home/user/LSC_path/runLSC.py run.cfg" while in your working directory. or Execute "runLSC.py run.cfg, if all LSC executable files are in the default bin
- After a certain time execution will conclude. You can find results in the "output" directory.
"runLSC.py" is the main program in the LSC package. It calls the other modules to run the full error correction on your data. Output is written to the "output" folder. Details of the output are described in file formats. Its options are described run.cfg. Please just need to run "runLSC.py" with a configuration file "run.cfg":
- the .cfg file which defines the run parameters. For details, see .cfg format
where "/home/user/LSC_path/" is the path of your LSC package. If you have put all LSC executable files in the default path, then you just need to run the first example.
There are three output files: corrected_LR_SR.map.fa, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa in output folder:
- As long as there are short reads (SR) mapped to a long read, this long read can be corrected at the SR-covered regions. (Please see more details from the paper). The sequence from the left-most SR-covered base to the right-most SR-covered base is outputted in the file corrected_LR_SR.map.fa
- Although the terminus sequences are corrected, they are concatenated with their corrected sequence (corrected_LR_SR.map.fa) to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file full_LR_SR.map.fa
- This is the negative control. uncorrected_LR_SR.map.fa contains the left-most SR-covered base to the right-most SR-covered base (equivalent region in corrected_LR_SR.map.fa) but not error corrected. Thus, it is fragments of the raw reads.
1. Map to the reference genome or annotation (for RNA-seq analysis). Then, filter the reads by mapping score or percentage of base match (e.g. "identity" in BLAT)
2. If there is no reference genome or annotation, then map the short reads back to the output (corrected_LR_SR.map.fa or full_LR_SR.map.fa). Select the one with high SR coverage.
You need two Novoalign modules: novoalign and novoindex. We recommend the version "Novoalign V2.07.10"
Note that Novoalign is proprietary software, so we cannot distribute it with LSC. However, if you are licensed to use Novoalign, contact us and we can help email a copy to you.
The following execution times are guesstimates based on the running times on our servers with eigth thread. These figures will greatly differ based on your system configuration.
- 200,000 PacBio long reads X 64 million 75bp Illumina short reads - 10 hours
This speed should be faster than similar tools.