baskerville.scripts package

Submodules

baskerville.scripts.hound_data module

baskerville.scripts.hound_data.curate_peaks(targets_df, out_dir, pool_width, crop_bp)[source]

Merge all peaks, round to nearest pool_width, and add cropped bp.

baskerville.scripts.hound_data.divide_contigs_chr(contigs, test_chrs, valid_chrs)[source]

Divide list of contigs into train/valid/test lists by chromosome.

baskerville.scripts.hound_data.divide_contigs_folds(contigs, folds)[source]

Divide list of contigs into cross fold lists.

baskerville.scripts.hound_data.divide_contigs_pct(contigs, test_pct, valid_pct, pct_abstain=0.2)[source]

Divide list of contigs into train/valid/test lists, aiming for the specified nucleotide percentages.

baskerville.scripts.hound_data.limit_contigs(contigs, filter_bed)[source]

Limit to contigs overlapping the given BED.

Args

contigs: list of Contigs filter_bed: BED file to filter by

Returns:

list of Contigs

Return type:

fcontigs

baskerville.scripts.hound_data.main()[source]

baskerville.scripts.hound_data_align module

class baskerville.scripts.hound_data_align.GraphSeq(genome, net, chr, start, end)

Bases: tuple

chr

Alias for field number 2

end

Alias for field number 4

genome

Alias for field number 0

net

Alias for field number 1

start

Alias for field number 3

baskerville.scripts.hound_data_align.break_large_contigs(contigs, break_t, verbose=False)[source]

Break large contigs in half until all contigs are under the size threshold.

baskerville.scripts.hound_data_align.connect_contigs(contigs, align_net_file, net_fill_min, net_olap_min, out_dir, genome_out_dirs)[source]

Connect contigs across genomes by forming a graph that includes net format aligning regions and contigs. Compute contig components as connected components of that graph.

baskerville.scripts.hound_data_align.contig_stats_genome(contigs)[source]

Compute contig statistics within each genome.

baskerville.scripts.hound_data_align.divide_components_folds(contig_components, folds)[source]

Divide contig connected components into cross fold lists.

baskerville.scripts.hound_data_align.divide_components_pct(contig_components, test_pct, valid_pct, pct_abstain=0.5)[source]

Divide contig connected components into train/valid/test, and aiming for the specified nucleotide percentages.

baskerville.scripts.hound_data_align.intersect_contigs_nets(graph_contigs_nets, genome_i, out_dir, genome_out_dir, min_olap=128)[source]

Intersect the contigs and nets from genome_i, adding the overlaps as edges to graph_contigs_nets.

baskerville.scripts.hound_data_align.main()[source]
baskerville.scripts.hound_data_align.make_net_graph(align_net_file, net_fill_min, out_dir)[source]

Construct a Graph with aligned net intervals connected by edges.

baskerville.scripts.hound_data_align.quantify_leakage(align_net_file, train_contigs, valid_contigs, test_contigs, out_dir)[source]

Quanitfy the leakage across sequence sets.

baskerville.scripts.hound_data_align.report_divide_stats(fold_contigs)[source]

Report genome-specific statistics about the division of contigs between sets.

baskerville.scripts.hound_data_align.report_divide_stats_v1(train_contigs, valid_contigs, test_contigs)[source]

Report genome-specific statistics about the division of contigs between train/valid/test sets.

baskerville.scripts.hound_data_read module

class baskerville.scripts.hound_data_read.CovFace(cov_file)[source]

Bases: object

close()[source]
preprocess_bed()[source]
read(chrm, start, end)[source]
baskerville.scripts.hound_data_read.interp_nan(x, kind='linear')[source]

Linearly interpolate to fill NaN.

baskerville.scripts.hound_data_read.main()[source]
baskerville.scripts.hound_data_read.read_blacklist(blacklist_bed, black_buffer=20)[source]

Construct interval trees of blacklist regions for each chromosome.

baskerville.scripts.hound_data_write module

baskerville.scripts.hound_data_write.feature_bytes(values)[source]

Convert numpy arrays to bytes features.

baskerville.scripts.hound_data_write.feature_floats(values)[source]

Convert numpy arrays to floats features. Requires more space than bytes for float16

baskerville.scripts.hound_data_write.fetch_dna(fasta_open, chrm, start, end)[source]

Fetch DNA when start/end may reach beyond chromosomes.

baskerville.scripts.hound_data_write.main()[source]
baskerville.scripts.hound_data_write.rround(a, decimals)[source]

Round to the specified number of decimals, randomly sampling the last digit according to a bernoulli RV.

baskerville.scripts.hound_data_write.tround(a, decimals)[source]

Truncate to the specified number of decimals.

baskerville.scripts.hound_eval module

baskerville.scripts.hound_eval.main()[source]

baskerville.scripts.hound_eval_spec module

baskerville.scripts.hound_eval_spec.main()[source]

baskerville.scripts.hound_ism_bed module

baskerville.scripts.hound_ism_bed.main()[source]

baskerville.scripts.hound_ism_snp module

baskerville.scripts.hound_ism_snp.main()[source]

baskerville.scripts.hound_predbed module

baskerville.scripts.hound_predbed.bigwig_open(bw_file, genome_file)[source]

Open the bigwig file for writing and write the header.

baskerville.scripts.hound_predbed.bigwig_write(signal, seq_coords, bw_file, genome_file, seq_crop=0)[source]
Write a signal track to a BigWig file over the region

specified by seqs_coords.

Args

signal: Sequences x Length signal array seq_coords: (chr,start,end) bw_file: BigWig filename genome_file: Chromosome lengths file seq_crop: Sequence length cropped from each side of the sequence.

baskerville.scripts.hound_predbed.main()[source]

baskerville.scripts.hound_snp module

baskerville.scripts.hound_snp.main()[source]

baskerville.scripts.hound_snp_slurm module

baskerville.scripts.hound_snpgene module

baskerville.scripts.hound_snpgene.main()[source]

baskerville.scripts.hound_train module

baskerville.scripts.hound_train.main()[source]

Module contents