-
Notifications
You must be signed in to change notification settings - Fork 6
add groovy function to validate blastn database format and busco line… #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.2.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
muffato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice usage of Groovy !
Usually the .nf files contain:
- imports
- workflow
- functions
Could you move the functions after the workflow block ?
subworkflows/local/input_check.nf
Outdated
| // Direct file provided - validate it's a .nal file | ||
| if (path_file.name.endsWith('.nal')) { | ||
| log.info "Using directly specified BLAST database: ${path_file}" | ||
| return path_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning the path here leads to some problems downstream because other processes expect a directory.
For instance, if you run the test profile with --blastn /data/tol/resources/nt/latest/nt.nal, generate_config.py will fail:
Command error:
Traceback (most recent call last):
File "/nfs/users/nfs_m/mm49/workspace/tol-it/nextflow/sanger-tol/blobtoolkit_param/bin/generate_config.py", line 399, in <module>
sys.exit(main())
File "/nfs/users/nfs_m/mm49/workspace/tol-it/nextflow/sanger-tol/blobtoolkit_param/bin/generate_config.py", line 377, in main
taxon_id = adjust_taxon_id(args.nt, taxon_info)
File "/nfs/users/nfs_m/mm49/workspace/tol-it/nextflow/sanger-tol/blobtoolkit_param/bin/generate_config.py", line 236, in adjust_taxon_id
con = sqlite3.connect(os.path.join(nt, "taxonomy4blast.sqlite3"))
sqlite3.OperationalError: unable to open database file
It could be modified to find the file taxonomy4blast.sqlite3 in the directory that contains nt.nal, but I suspect other things will fail too.
Especially, blastn needs the entire directory to be staged in. I fear it won't work if the channel only has the file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the function should accept direct .nal file paths and returns the parent directory to ensure all associated files are available. I have added validation logic which to check file existance before proceeding. Testing now.
| * Function to validate and resolve BLAST nucleotide database paths | ||
| * Handles both directory paths (for backwards compatibility) and direct .nal file paths | ||
| */ | ||
| def validateBlastnDatabase(db_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a problem with the way the function is used. When people mix multiple Blast databases under the same directory (like in #184), yes, they can now refer to one .nal specifically, but if only the parent directory makes it to the blastn job, the module will then still pick up all the .nal files.
I think there should be another parameter coming out of the function, to record the name of the .nal file, and pass it down to the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the db_name parameter is now implemented and flows from the validation function to the BLAST module.
muffato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you test the Blast db symlinking ? I'm getting this error in my tests
ERROR ~ Unknown method invocation `createLink` on UnixPath type
| def db_name = path_file.name.replaceAll('\\.nal$', '') | ||
|
|
||
| // Create a temporary directory with symlinks to only the specified database files | ||
| def temp_dir = file("${parent_dir}/.btk_isolated_${db_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we may not always have write access to that directory (for instance, in production we don't – /data/tol/resources/nt).
Maybe we can use the work dir (I think there's a Nextflow variable that tells what the work dir is). But then, my concern is that the symlinks may not be resolved by Nextflow, and the files may not be mounted in the container.
| // Create a temporary directory with symlinks to only the specified database files | ||
| def temp_dir = file("${parent_dir}/.btk_isolated_${db_name}") | ||
| if (!temp_dir.exists()) { | ||
| temp_dir.mkdirs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily, this doesn't seem to complain if the directory can't be created. It just carries on (and the pipeline fails later).
$ nextflow run -profile sanger,singularity,test --blastn /data/tol/resources/nt/latest/nt.nal
(...)
Direct BLAST database file specified: /data/tol/resources/nt/latest/nt.nal
Database name: nt
Created isolated directory: /data/tol/resources/nt/latest/.btk_isolated_nt
This ensures only the specified database is available to BLAST
(...)
Execution cancelled -- Finishing pending tasks before exit
-[sanger-tol/blobtoolkit] Pipeline completed with errors-
(...)
Command error:
(...)
File "/nfs/users/nfs_m/mm49/workspace/tol-it/nextflow/sanger-tol/blobtoolkit_param/bin/generate_config.py", line 230, in adjust_taxon_id
con = sqlite3.connect(os.path.join(nt, "taxonomy4blast.sqlite3"))
sqlite3.OperationalError: unable to open database file
(...)
$ ls -ld /data/tol/resources/nt/latest/.btk_isolated_nt
ls: cannot access '/data/tol/resources/nt/latest/.btk_isolated_nt': No such file or directoryCo-authored-by: Matthieu Muffato <[email protected]>
…ages db
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).