baskerville.scripts package
Submodules
baskerville.scripts.hound_data module
- baskerville.scripts.hound_data.curate_peaks(targets_df, out_dir, pool_width, crop_bp)[source]
Merge all peaks, round to nearest pool_width, and add cropped bp.
- baskerville.scripts.hound_data.divide_contigs_chr(contigs, test_chrs, valid_chrs)[source]
Divide list of contigs into train/valid/test lists by chromosome.
- baskerville.scripts.hound_data.divide_contigs_folds(contigs, folds)[source]
Divide list of contigs into cross fold lists.
- baskerville.scripts.hound_data.divide_contigs_pct(contigs, test_pct, valid_pct, pct_abstain=0.2)[source]
Divide list of contigs into train/valid/test lists, aiming for the specified nucleotide percentages.
baskerville.scripts.hound_data_align module
- class baskerville.scripts.hound_data_align.GraphSeq(genome, net, chr, start, end)
Bases:
tuple
- chr
Alias for field number 2
- end
Alias for field number 4
- genome
Alias for field number 0
- net
Alias for field number 1
- start
Alias for field number 3
- baskerville.scripts.hound_data_align.break_large_contigs(contigs, break_t, verbose=False)[source]
Break large contigs in half until all contigs are under the size threshold.
- baskerville.scripts.hound_data_align.connect_contigs(contigs, align_net_file, net_fill_min, net_olap_min, out_dir, genome_out_dirs)[source]
Connect contigs across genomes by forming a graph that includes net format aligning regions and contigs. Compute contig components as connected components of that graph.
- baskerville.scripts.hound_data_align.contig_stats_genome(contigs)[source]
Compute contig statistics within each genome.
- baskerville.scripts.hound_data_align.divide_components_folds(contig_components, folds)[source]
Divide contig connected components into cross fold lists.
- baskerville.scripts.hound_data_align.divide_components_pct(contig_components, test_pct, valid_pct, pct_abstain=0.5)[source]
Divide contig connected components into train/valid/test, and aiming for the specified nucleotide percentages.
- baskerville.scripts.hound_data_align.intersect_contigs_nets(graph_contigs_nets, genome_i, out_dir, genome_out_dir, min_olap=128)[source]
Intersect the contigs and nets from genome_i, adding the overlaps as edges to graph_contigs_nets.
- baskerville.scripts.hound_data_align.make_net_graph(align_net_file, net_fill_min, out_dir)[source]
Construct a Graph with aligned net intervals connected by edges.
- baskerville.scripts.hound_data_align.quantify_leakage(align_net_file, train_contigs, valid_contigs, test_contigs, out_dir)[source]
Quanitfy the leakage across sequence sets.
baskerville.scripts.hound_data_read module
baskerville.scripts.hound_data_write module
- baskerville.scripts.hound_data_write.feature_bytes(values)[source]
Convert numpy arrays to bytes features.
- baskerville.scripts.hound_data_write.feature_floats(values)[source]
Convert numpy arrays to floats features. Requires more space than bytes for float16
- baskerville.scripts.hound_data_write.fetch_dna(fasta_open, chrm, start, end)[source]
Fetch DNA when start/end may reach beyond chromosomes.
baskerville.scripts.hound_eval module
baskerville.scripts.hound_eval_genes module
baskerville.scripts.hound_eval_spec module
baskerville.scripts.hound_isd_bed module
- baskerville.scripts.hound_isd_bed.main()[source]
hound_isd_sed.py
Perform an in silico deletion mutagenesis of sequences in a BED file, where predictions are centered on the variant and SED/logSED scores can be calculated. Outputs a separate .h5 file for each .bed entry.
- Usage:
hound_isd_sed.py [options] <params_file> <model_file> <bed_file>
- Options:
- -f <genome_fasta>
Genome FASTA for sequences
- -g <genes_gtf>
GTF for gene definition
- -o <out_dir>
Output directory
- -p <processes>
Number of processes, passed by multi script
- --rc
Ensemble forward and reverse complement predictions
- --shifts
Ensemble prediction shifts
- --span
Aggregate entire gene span
- --stats
Comma-separated list of stats to save.
- -s <del_size>
Deletion size for ISD
- --target_genes
List of target genes in .tsv format, length must match input bed entries
- -t <targets_file>
File specifying target indexes and labels in table format
- --untransform_old
Untransform old models
- baskerville.scripts.hound_isd_bed.make_del_bedt(coords, seq_len: int, del_size: int)[source]
Make a BedTool object for all SNP sequences, where seq_len considers cropping.
- baskerville.scripts.hound_isd_bed.map_delseq_genes(coords, seq_len: int, del_size: int, transcriptome, model_stride: int, span: bool, majority_overlap: bool = True, intron1: bool = False)[source]
Intersect SNP sequences with gene exons, constructing a list mapping sequence indexes to dictionaries of gene_ids to their exon-overlapping positions in the sequence.
- Parameters:
snps ([bvcf.SNP]) – SNP list.
seq_len (int) – Sequence length, after model cropping.
transcriptome (Transcriptome) – Transcriptome.
model_stride (int) – Model stride.
span (bool) – If True, use gene span instead of exons.
majority_overlap (bool) – If True, only consider bins for which the majority of the space overlaps an exon.
intron1 (bool) – If True, include intron bins adjacent to junctions.
baskerville.scripts.hound_ism_bed module
baskerville.scripts.hound_ism_snp module
baskerville.scripts.hound_predbed module
- baskerville.scripts.hound_predbed.bigwig_open(bw_file, genome_file)[source]
Open the bigwig file for writing and write the header.
- baskerville.scripts.hound_predbed.bigwig_write(signal, seq_coords, bw_file, genome_file, seq_crop=0)[source]
- Write a signal track to a BigWig file over the region
specified by seqs_coords.
- Args
signal: Sequences x Length signal array seq_coords: (chr,start,end) bw_file: BigWig filename genome_file: Chromosome lengths file seq_crop: Sequence length cropped from each side of the sequence.