baskerville.scripts package

Submodules

baskerville.scripts.hound_data module

baskerville.scripts.hound_data.curate_peaks(targets_df, out_dir, pool_width, crop_bp)[source]: Merge all peaks, round to nearest pool_width, and add cropped bp.

baskerville.scripts.hound_data.divide_contigs_chr(contigs, test_chrs, valid_chrs)[source]: Divide list of contigs into train/valid/test lists by chromosome.

baskerville.scripts.hound_data.divide_contigs_folds(contigs, folds)[source]: Divide list of contigs into cross fold lists.

baskerville.scripts.hound_data.divide_contigs_pct(contigs, test_pct, valid_pct, pct_abstain=0.2)[source]: Divide list of contigs into train/valid/test lists, aiming for the specified nucleotide percentages.

baskerville.scripts.hound_data.limit_contigs(contigs, filter_bed)[source]

Limit to contigs overlapping the given BED.

Args: contigs: list of Contigs filter_bed: BED file to filter by

Returns:: list of Contigs
Return type:: fcontigs

baskerville.scripts.hound_data.main()[source]

baskerville.scripts.hound_data_align module

class baskerville.scripts.hound_data_align.GraphSeq(genome, net, chr, start, end)

Bases: tuple

chr: Alias for field number 2

end: Alias for field number 4

genome: Alias for field number 0

net: Alias for field number 1

start: Alias for field number 3

baskerville.scripts.hound_data_align.break_large_contigs(contigs, break_t, verbose=False)[source]: Break large contigs in half until all contigs are under the size threshold.

baskerville.scripts.hound_data_align.connect_contigs(contigs, align_net_file, net_fill_min, net_olap_min, out_dir, genome_out_dirs)[source]: Connect contigs across genomes by forming a graph that includes net format aligning regions and contigs. Compute contig components as connected components of that graph.

baskerville.scripts.hound_data_align.contig_stats_genome(contigs)[source]: Compute contig statistics within each genome.

baskerville.scripts.hound_data_align.divide_components_folds(contig_components, folds)[source]: Divide contig connected components into cross fold lists.

baskerville.scripts.hound_data_align.divide_components_pct(contig_components, test_pct, valid_pct, pct_abstain=0.5)[source]: Divide contig connected components into train/valid/test, and aiming for the specified nucleotide percentages.

baskerville.scripts.hound_data_align.intersect_contigs_nets(graph_contigs_nets, genome_i, out_dir, genome_out_dir, min_olap=128)[source]: Intersect the contigs and nets from genome_i, adding the overlaps as edges to graph_contigs_nets.

baskerville.scripts.hound_data_align.main()[source]

baskerville.scripts.hound_data_align.make_net_graph(align_net_file, net_fill_min, out_dir)[source]: Construct a Graph with aligned net intervals connected by edges.

baskerville.scripts.hound_data_align.quantify_leakage(align_net_file, train_contigs, valid_contigs, test_contigs, out_dir)[source]: Quanitfy the leakage across sequence sets.

baskerville.scripts.hound_data_align.report_divide_stats(fold_contigs)[source]: Report genome-specific statistics about the division of contigs between sets.

baskerville.scripts.hound_data_align.report_divide_stats_v1(train_contigs, valid_contigs, test_contigs)[source]: Report genome-specific statistics about the division of contigs between train/valid/test sets.

baskerville.scripts.hound_data_read module

class baskerville.scripts.hound_data_read.CovFace(cov_file)[source]

Bases: object

close()[source]

preprocess_bed()[source]

read(chrm, start, end)[source]

baskerville.scripts.hound_data_read.interp_nan(x, kind='linear')[source]: Linearly interpolate to fill NaN.

baskerville.scripts.hound_data_read.main()[source]

baskerville.scripts.hound_data_read.read_blacklist(blacklist_bed, black_buffer=20)[source]: Construct interval trees of blacklist regions for each chromosome.

baskerville.scripts.hound_data_write module

baskerville.scripts.hound_data_write.feature_bytes(values)[source]: Convert numpy arrays to bytes features.

baskerville.scripts.hound_data_write.feature_floats(values)[source]: Convert numpy arrays to floats features. Requires more space than bytes for float16

baskerville.scripts.hound_data_write.fetch_dna(fasta_open, chrm, start, end)[source]: Fetch DNA when start/end may reach beyond chromosomes.

baskerville.scripts.hound_data_write.main()[source]

baskerville.scripts.hound_data_write.rround(a, decimals)[source]: Round to the specified number of decimals, randomly sampling the last digit according to a bernoulli RV.

baskerville.scripts.hound_data_write.tround(a, decimals)[source]: Truncate to the specified number of decimals.

baskerville.scripts.hound_eval module

baskerville.scripts.hound_eval.main()[source]

baskerville.scripts.hound_eval_genes module

baskerville.scripts.hound_eval_spec module

baskerville.scripts.hound_eval_spec.main()[source]

baskerville.scripts.hound_isd_bed module

baskerville.scripts.hound_isd_bed.clip_float(x, dtype=<class 'numpy.float16'>)[source]

baskerville.scripts.hound_isd_bed.main()[source]

hound_isd_sed.py

Perform an in silico deletion mutagenesis of sequences in a BED file, where predictions are centered on the variant and SED/logSED scores can be calculated. Outputs a separate .h5 file for each .bed entry.

Usage:

hound_isd_sed.py [options] <params_file> <model_file> <bed_file>

Options:

-f <genome_fasta>: Genome FASTA for sequences
-g <genes_gtf>: GTF for gene definition
-o <out_dir>: Output directory
-p <processes>: Number of processes, passed by multi script
--rc: Ensemble forward and reverse complement predictions
--shifts: Ensemble prediction shifts
--span: Aggregate entire gene span
--stats: Comma-separated list of stats to save.
-s <del_size>: Deletion size for ISD
--target_genes: List of target genes in .tsv format, length must match input bed entries
-t <targets_file>: File specifying target indexes and labels in table format
--untransform_old: Untransform old models

baskerville.scripts.hound_isd_bed.make_del_bedt(coords, seq_len: int, del_size: int)[source]: Make a BedTool object for all SNP sequences, where seq_len considers cropping.

baskerville.scripts.hound_isd_bed.map_delseq_genes(coords, seq_len: int, del_size: int, transcriptome, model_stride: int, span: bool, majority_overlap: bool = True, intron1: bool = False)[source]

Intersect SNP sequences with gene exons, constructing a list mapping sequence indexes to dictionaries of gene_ids to their exon-overlapping positions in the sequence.

Parameters:

snps ([bvcf.SNP]) – SNP list.
seq_len (int) – Sequence length, after model cropping.
transcriptome (Transcriptome) – Transcriptome.
model_stride (int) – Model stride.
span (bool) – If True, use gene span instead of exons.
majority_overlap (bool) – If True, only consider bins for which the majority of the space overlaps an exon.
intron1 (bool) – If True, include intron bins adjacent to junctions.

baskerville.scripts.hound_ism_bed module

baskerville.scripts.hound_ism_bed.main()[source]

baskerville.scripts.hound_ism_snp module

baskerville.scripts.hound_ism_snp.main()[source]

baskerville.scripts.hound_predbed module

baskerville.scripts.hound_predbed.bigwig_open(bw_file, genome_file)[source]: Open the bigwig file for writing and write the header.

baskerville.scripts.hound_predbed.bigwig_write(signal, seq_coords, bw_file, genome_file, seq_crop=0)[source]

Write a signal track to a BigWig file over the region: specified by seqs_coords.
Args: signal: Sequences x Length signal array seq_coords: (chr,start,end) bw_file: BigWig filename genome_file: Chromosome lengths file seq_crop: Sequence length cropped from each side of the sequence.

baskerville.scripts.hound_predbed.main()[source]

baskerville.scripts.hound_snp module

baskerville.scripts.hound_snp.main()[source]

baskerville.scripts.hound_snp_slurm module

baskerville.scripts.hound_snpgene module

baskerville.scripts.hound_snpgene.main()[source]

baskerville.scripts.hound_train module

baskerville.scripts.hound_train.main()[source]

baskerville.scripts.hound_transfer module

baskerville.scripts.hound_transfer.main()[source]