Comp Bio Exam 3

Helpfulness: 0
Set Details Share
created 6 years ago by srigot55
updated 6 years ago by srigot55
show moreless
Page to share:
Embed this setcancel
code changes based on your size selection

5 measures of a genome and how to measure them computationally

card image

genome size

  • use genome assembly to measure a genome size
  • In prokaryotes the number of genes is linearly correlated with genome length, not in non-prokaryotes
  • C-Value paradox= genome size does NOT correlate with organismal complexity
  • some taxa vary more than others
  • TE and other repetitive sequences seem to be biggest cause of variation

C- value paradox

  • c-value = total amount of DNA per cell
  • paradox= genome size does not correlate with organismal complexity

why are some genomes so big?

  1. main reason: transposable elements (TE) (DNA sequences that insert themselves in a host genome by moving or copying themselves)
    • the number of TEs can correlate with total genome size
    • main contributor to total genome size variation, but not the only variant (introns are also important, miscellaneous sequences, other repetitive DNA, etc)
  2. polyploidy and whole genome duplication (especially common in plants)
  3. protection against viruses (if you have more useless DNA, then you are less likely to lose important DNA to a virus)

measuring the number of shared genes

  • use gene annotation to label the function/name of each gene
    • use RNAs to find genes in genomes
  • compare shared gene content between 2 different genomes
    • make phylogenic tree to find evolutionary relationships
    • many shared genes= evolutionarily similar = small phylogenic distance
    • % of genes shared = distance metric
  • most genes are shared in eukaryotes (sequences for core cell functions)

gene families and copy number variation

  • gene families= group of genes that are repeatedly duplicated and have similar structure and/or function
    • genes duplicated, then the function diverged via subfunctionalization
  • some genomes have more copies of a particular gene than others
    • = copy number variation
    • ex. human olfactory gene receptor- huge variation in copy number across people (we don't now why)
    • ex. whole genome duplication (especially common in plants)


  • how a duplicated gene acquires a new, but complementary, function after its duplicated
  • natural selection does not affect the duplicated copy as much
  • evidence comes from divergent protein sequences, but correlated patterns of gene expression


card image
  • when 2 genomes have similar linear arrangements of genes along chromosomes
  • decays due to chromosomal inversions, translocations, etc.
    • visible on a dot plot
    • in above picture- more blocks and straight lines = more synteny

dot plot

card image
  • visible measure of synteny between 2 genomes
  • positive or negative depend on the direction the sequences were laid out
  • a= perfect synteny
    • other dots surrounding lines= TEs and other irregularities
  • b= duplications
  • c= inversion
  • d= partial inversion
  • e & f= repetition in both sequences
  • g= homologous sequences with the number of gaps increasing with mutation rate and time
  • h= partial deletion in a sequence

gene function

  • measure by determining gene ontology
  • ex. find human genome has many more genes for defense and immunity than other organisms
    • due to: adaptive immune system, major histocompatibility complex, immunoglobulin superfamilies, etc.

gene ontology (GO)

  • controlled hierarchical vocabulary for describing and categorizing gene function
  • tree used to fit each gene into a category

protein structure determination methods

  1. x-ray crystallography: about 90% of the entries in the protein data bank (PDB)
  2. nuclear magnetic resonance: about 9% of PDB entries
  3. computational approaches: predicting structures since the above methods are difficult and expensive

Machine learning

way to recognize patterns in large sets of data

  • hidden markov models (HMM)
  • artificial neural networks (ANN)

levels of protein structure

card image
  • primary (1o)- amino acid sequence, linear arrangement
  • secondary (2o)- first layer of folding, alpha helix, beta pleated sheets, loops
  • tertiary (3o)- secondary structures interact among themselves
  • quaternary (4o)- interaction among tertiary structures, protein complexes

ab initio

  • protein prediction methods that use the physical and chemical properties of the 1o structure (amino acids) to predict the 2o structure
  • long, successful history

homology modeling

  • approach of protein modeling depends on databases of known protein structures to generate a predicted structure
  • database of 1o to predict 2o structure
  • more useful than ab initio, but more probable
  • ex: hidden markov models, machine learning ("training")

Markov chains

  • describes the probability of a "system" being in some "state" after some period of time
  • process of transitioning from one state to another
  • has an associated transition matrix
  • posterior probability distribution can be obtained from the monte carlo method

(1st order) hidden markov models (HMM)

card image
  • type of homology modeling
  • assume you know something about which patterns of amino acids will result in a particular structure
  • "states" are hidden (not known) and each state corresponds to a 2o structure of a protein sequence
    • ex. k=3 states: a loop (L), alpha-helical (α), or beta-pleated sheet (β) regions
  • uses frequencies of adjacent pairs of amino acids in a sequence
  • choose a pathway that maximizes likelihood (lines connecting max's at each red dot, x=AA from 1-N)
  • challenging with large sets of AAs and hidden states (algorithm is on the order of kN)

2 parameters needed for HMM

  1. transition probability for the states
  2. emission probability
    • probability that some AA emits a particular state

HMM transition probabilities

card image

correlation of amino acid (AA) a and AA b in a sequence

nab= number of times a follows b

na= number of times state a happens

can be modified to account for position-specific amino acid frequencies


Emission probability

card image

the probability that AA a is in state k =ek(a)

nka= number of times AA a is in state k

nk= number of times any AA is in state k

can be used to calculate likelihood


likelihood of HMM

card image
  • Likelihood of state B= probability of B * transition probability (r)* emission probability (e) for each amino acid
  • find the max likelihood at each state, choose a pathway that maximizes likelihood
  • in a 1st order HMM, the probability of an AA only depends upon the AA directly preceding it in the sequence
  • k=# states, N = number of AA
    • k*N= possible states
    • kN possible pathways

HMM "Training"

  • running the HMM algorithm on a known protein structure to obtain realistic estimates of both the emission and transition probabilities
  • Viterbi algorithm= efficient way of training

Viterbi algorithm

  • calculated the most likely next step in the pathways that we should choose
  • "training" for HMM algorithm
  • reduces the number of computations from the order of kN to kN
  • only considers the highest likelihood at each step in the sequence (greedy)

posterior probability for Markov chain

  • determining the probability of the time the system spends in each state after gaining information from the HMM analysis
  • gaining more info about the Markov chain after doing the HMM
  • determined from repeatedly choosing random pathways via the Monte Carlo method

Monte Carlo

randomly choose pathways in an HMM (x times) and obtain a posterior probability distribution of the markov chain


2nd order HMM

useful for searching for coding regions in a DNA sequence

assign emission probabilities for codon triplets

increasing the order of the HMM algorithm often increases its accuracy (but requires bigger database)


artificial neural networks (ANNs)

  • general way to do both supervised and unsupervised machine learning
  • widely used for handwriting/speech/facial recognition, bioinformatics, and signal processing
  • used a lot in computer programing since it simulates thinking of the brain
  • used to predict protein secondary structure and analyze patterns of gene expression
  • simple ANN: perceptron


card image

simple artificial neural network

single "neuron" that takes some amount of input and decides whether or not to "fire"

binary input and output

behavior is governed by an activation function


feed forward network (FF)

card image
  • perceptron received weighted input signals (usually biological data), processes it, and sends an output signal ("guess" as to what the pattern might be)

4 steps:

  1. receive input signal (xi)
    1. typically this is a binary (1 or -1)
  2. Weight each input signal by factor of wi
  3. sums across all input signals (Co)
    1. Co = sum (xi*wi)
  4. use activation function to emit output (-1 or 1)

perceptron "guess" error

  • If the perceptron guesses the correct answer, the error is 0.
  • If the correct answer is -1 and we guess +1, then the error is -2.
  • If the correct answer is +1 and we guess -1, then the error is +2

perceptron backpropogation of error

readjust the weights on the inputs based on the output error to work backwards and minimize the error

new weight= old weight + error * input


neural network structure

  • single perceptron can only solve linearly separable problems -> network for more complicated problems
  • typical FF ANN model has >2 layers (input layer, output layer, and hidden layers)

hidden layers (ANN)

  • size and number of hidden layers is arbitrary
  • more hidden layers= more specific, but slower output
  • layers have no inherent meaning
  • can recognize more complex patterns

biological uses for ANN

  • used to predict protein secondary structure and analyze patterns of gene expression
  • Jones (1999) found ANN methods accurately predicted protein secondary structure 77% of the time
    • Peterson in 2000 predicted is 80% of the time
  • gene expression for cancer research:
    • input= transcription for a gene is on or off
    • output= some phenotype (ex. signaling pathway is activated)
    • O'Neill and Song (2003) could predict survival of patients with B-cell Lymphoma, based off gene expression of 4,026 genes


  • ANN= more arbitrary, general, flexible, powerful, sensitive (can recognize more complex patterns), and more accurate (77-80% accuracy at predicting protein secondary structure)
    • HMM= easier to "train" and implement

methods for predicting RNA structure

  • secondary structure can be determined using computational methods and the primary sequence
  • tertiary structures must be determined by x-ray crystallography

types of RNA secondary structure

card image

loops usually need at least 3 bases

  • internal loop
  • multi-branched loop
  • hairpin loop
  • bulges

Most stable bonds: purines (A&G) with pyrimidines (U&C)

G-C pairing is the best predictor of stability in an RNA structure


RNA structure models (i-j k-l base pairs)

  • bases from 1 to N with I and j complementary (i < j) and k and l complementary (k <l)
  • pairs are compatible if :
    • i <j<k<l (side by side configuration)
    • i < k< l<j (looped configuration)
    • i < k< j<l (pseudoknot- observed rarely in real RNA structures and are generally excluded from algorithms)

RNA structure algorithm

  • find a structure that maximizes the number of base pairs in a sequences
    • minimizes free energy (=E, thermodynamically stable)
    • makes a matrix of energies for each pairing to find the min E (recursive)
  • no bases within 3 bases of each other can pair
  • use matrix of pointers to keep track of the min E (like backtrace matrix in sequence alignment)

compensatory and other mutations in RNA

card image
  • compensatory mutations- if there is a mutation at one position in the RNA structure, then there is usually a mutation at its complementary position to maintain its secondary structure
  • mismatches are rare
  • G-C binds are highly conserved (mutability (tendency to mutate) is low)

RNA secondary structure implications on phylogenetics

  • bases do not evolve independently because of compensatory mutations
    • conflicts with neutral theory
  • accounting for RNA secondary structure greatly increases the accuracy of phylogenetic reconstruction

ribozymes and the origin of life

  • ribozymes= RNA structures with catalytic functions
  • origin of life: ribozymes started life because they contain hereditary material, and could catalyze their own replication (Unlike DNA or other cells)
  • hammerhead ribozymes: RNAs with conserved secondary structure that is autocatalytic


totality of mRNA in a cell at any given time


microarray definition

a common high-throughput method for quantifying the transcriptome (amount of mRNA in a cell)

depends critically on bioinformatics methods


types of RNAs

  • mRNA- translated into protein and has an open reading frame (ORF)
    • ORFs- can predict if a sequence is going to be encoded into a protein (but some ORFs exist that do not make proteins- can be misleading)
  • Interferring RNAs: post-transcription gene silencing (suppress translation of the RNA "message"), short sequences
  • long non-coding RNAs (lnc RNAs): mostly no ORFs, thought to be junk but might have a function in chromosome remodeling

types of gene expression data

  • microarrays: hybridization of RNA to slides (sequence RNA)
  • ChIP-Seq: chromatin immunoprecipitation and sequencing
  • RNAseq: "whole-transcriptome" shotgun sequencing (sequence RNA without a template)

microarrays steps

card image
  1. isolate and purify the mRNA from samples (usually 1 control and 1 variable)
  2. reverse transcribe and label the mRNA
    • amplify via PCR then produce a complementary DNA (cDNA) strand with different fluorescent dyes (usually red and green) for the control and variable
  3. hybridize the labeled target in the microarray
  4. scan the microarray and quantitate the signal
    • amount of target sequence bound to each probe (amount of each color of fluorescence) correlates to the level of expression of the control or variable genes

analyzing microarray results

card image
  • ratio of (red)/(green) indicates up/down regulation of gene expression of red (up if >1, down if <1)
  • take log of ratio to tell real differences in samples (M=log(R/G))
    • if log = 1 then doubling of expression of red, if log=-1 then doubling or expression of green, log=0 is no difference
  • normalize for different positions on a slide ->take average intensity of a spot (A=1/2 log(RG))
  • bias in intensity across a slide -> plot M vs A
    • use LOWESS (locally weighted scatterplot smoothing) to normalize the bias

detecting changes in gene expression in microarrays

card image
  • assumes that random variation in gene expression is normally distributed
  • normalize the data using a z-score
    • | z scores | > 1.96 -> significantly different
  • also alternative statistical methods

pitfalls of microarrays

  • difficult to obtain "controls" for microarray experiments
  • transcriptome may vary randomly over: time, cells, tissue types, etc.
    • high variation= big issue
  • do differences in gene expression actually produce biological differences in phenotypes?
    • good for "scans"- gives you ideas that can be validated with more experiments

clustering distances for microarrays

  • can be used to detect large-scale patterns in microarray data
  • ex. define a matrix with columns are arrays from different samples and rows are intensity ratios from different genes
    • clustering by column: relationship between gene expression in different samples
    • clustering by rows: relationship between expression in different genes
  • different distance measurements produce different results -> post-hoc analysis


catalog of all cis- regulatory elements in a genome (genes that are next to each other)

the cistrome of DNA-associated proteins can be identified using ChIP-Seq


ChIP-Seq definition

  • used to analyze protein-DNA interactions
  • combines chromatin immunoprecipitaion with massively parallel DNA sequencing to identify the cistrome of DNA-associated proteins
  • once you've sequenced all the DNA for a protein, you can use BLAST to find the regions of the genome where that transcription factor is binding
  • greater number of bound sequences= greater "signal" = measure of how strong the sequence is binding
    • build a map of where in a genome the transcription factor (initiate transcription and control gene expression) is binding

ChIP-Seq Steps

card image
  1. isolate the DNA then chop the genome into fragmented chromatin
  2. immunoprecipitate-use an antibody to bind to specific proteins (transcription factors)
  3. DNA is recovered, sequenced, and aligned to a reference genome to determine which sequence is bound to each protein

problem with ChIP-Seq

  • if you find a regulatory sequence, it is hard to tell which gene it is regulating
    • assume its the gene next to the regulatory sequence (acts in cist)
    • problem: 2 genes can be co-regulated from the same regulatory sequence at different places on the genome (due to DNA folding)


method that can be used to:

  • map gene and exon boundaries
  • study gene expression in specific tissues
  • characterize transcriptome complexity or assay expression from unannotated genomic regions
  • precisely measure transcript abundance
    • number of reads matched to a gene (= level of gene expression)

1st generation (Sanger) sequencing

  1. DNA is fragmented
  2. cloned into a plasmid vector
  3. cyclic sequencing reaction
  4. separation by electrophoresis
  5. readout with fluorescent tags

very expensive and inefficient

has been replaced with "next generation" sequencing


parallelized sequencing

card image

sequencing DNA

  1. DNA is fragmented
  2. adaptors are ligated to fragments
  3. PCR colonies
    • sequence all DNA that hybridizes to a plate
  4. add an enzymatic extension with a fluorescent-tagged nucleotide
  5. cyclic readout by imaging fluorescence (microarrays that are imaged at each cycle)

can run millions of fragments at a time -> much more efficient and precise than Sanger (1st gen sequencing)


next-generation platforms of DNA sequencing

card image
  • Roche 454-based on “emulsion PCR”
  • Illumina/Solexa-based on “bridge PCR”
  • ABI SOLiD-based on “bridge PCR”
  • Pacific Biosciences

Emulsion PCR

card image

used by 454, polonator, and SOLiD next-gen sequencing

  • DNA fragments (with adapters) are PCR amplified within a water drop in oil to make "beads"
  • one primer is attached to the surface of a bead
  • pyrosequenced afterwards


  • DNA "sequencing by synthesis"
  • if you have one strand of DNA, you can find the complimentary strand by shining a light on the template DNA strand to determine which base comes next (chemiluminescent signals determine sequence)
  • used after emulsion PCR to reassemble the sequence

454 DNA sequencing platform

  • uses emulsion PCR and pyrosequencing
  • reads are a good length (~600 base pairs for drosophila)
  • expensive, but not as bad as sanger
  • uses standard flowgram format (SFF)

Bridge PCR

card image
  1. DNA fragments are flanked with adapters
  2. surface is coated with 2 types of primers (corresponding to adaptors)
  3. cyclic amplification process to result in multiplied fragments all standing up on adapters:
    1. denature DNA to cause it to bend over and attach to itself
    2. reaction to split DNA into 2 strands
    3. reaction detaches DNA


  • next-gen DNA sequencing by bridge PCR then aligns data by determining the first base, imaging the first base, determining the 2nd base, imaging 2nd base, etc.
  • uses Fastq file format and phred quality codes
  • truncating data may be necessary because quality scores increase in the last bit of data
  • produces shorter reads

Fastq file format

used in Illumina/Solexa sequencing

4 lines:

  1. Line 1: '@ sequence identifier #1'
  2. Line 2: sequence
  3. Line 3: '+ sequence identifier #2'
  4. Line 4: Phred Quality codes (how likely it is that the base is an error)

Very fast and computationally easy, but hard to index since '@' and '+' are identifiers and also used in the quality characters


Phred quality code

card image
  • used to indicate how likely it is that the base was found in error in DNA sequencing (Fastq file formats)
  • readable characters starting as ASCII 33
  • quality score = Q= ASCII-33
  • p= probability of error -> smaller is better
    • error rates: 10-(quality score/10) (ex. score of 30 -> likelihood of error = 10-3)
  • average score 30-40

paired-end reads

to more accurately align sequences (particularly shorter fragments and in repetitive regions)

break the genome into large fragments (>100 base pairs) then do a bridge PCR to get sequences from both ends that are a known distance apart


third generation DNA sequencing

  • nanopore- changes conformation of nucleic acids that are few through a nanopore, so the conductance is recognizable for each base
  • Real-time monitoring of PCR activity- waves allow direct observation of polymerase and fluorescent-labeled nucleotides

genome assembly

de novo or with template (reference)

paired-ends reads greatly increase the accuracy of assemblies


template-based genome assembly

  • requires a wellcurated reference genome sequence with >94% sequence identity
  • methods either index (break into smaller seeds) the genome reference sequence or the read sequences
  • match seeds as efficiently as possible (since there are so many genes in the genome for a reference genome)

De novo genome assembly

does not use a template DNA

De Bruijn graph is widely used

usually results in a large number of contigs which can be assembled into scaffolds that use distance information

ideally want few scaffolds as possible with long lengths


SAM/BAM assembly format

file format that is easy to work with once you understand it

indexes well

used in template-based sequence alignment

can read Fastq files


De Bruijn graph

card image

de novo genome assembly method

  1. break each sequence read into overlapping fragments of size k (k-mers)
  2. form the graph such that each k-mer represents a node in the graph and the nodes overlap by k-1 bases
  3. edge exists between nodes a and b if there exists a k-mer with prefix a and suffix b

need to prune as you go to avoid a huge graph

no solution



  • continuous stretches of assembled sequence of variable length
  • de novo genome assembly usually results in large number of contigs
  • can be assembled using scaffolds

De bruijn graph alignments

card image
  1. spur: caused by sequence error
  2. repetitive- usually 2 sequences connected by a TE in the middle
  3. loops- tandem repeats (multiple pathways)
  4. bubble- caused by misaligned sequences

used paired-ends reads to resolve repetitive sequences and misalignment


measuring protein-protein interactions (PPIs)

  • Tandem affinity purification with mass spectrometry (TAP-MS)
  • put "bait" protein into a cells and wait then pull it out and see what is attached
    • then get profiles through mass spectrometry

graph theory and adjacency matrix for PPIs

card image
  • use linear algebra to calculate properties of PPIs
  • adjacency matrix: each row and column is a protein and 1=bound, 0= not bound
  • degree of node= sum of row or column for that protein (number of proteins its bound to)
  • Dijksra's algorithm: finds the shortest path between 2 nodes


describes the centrality of a given node in the biological network

frequency with which a node is located on the shortest path between all other nodes

centrality= measure of pleitropy


3 types of biological networks (just names)

  1. random
  2. scale-free
  3. hierarchical

random biological network

card image

real networks look nothing like this

  • node degree follows a poisson distribution with probability p
    • some nodes are lowly and highly linked, but most are average linked
  • clustering size (c) is constant

scale-free biological network

card image

Networks have reported to be scale-free, but not statistically proven

  • The probability that a node has k links is k-y, where y is the degree exponent
    • degree distribution follows the power law (decreasing exponential where most nodes are lowly linked and few are highly linked)
  • probability (p) of linking decreases with increasing number of links (k)
  • clustering size (c) is constant

hierarchical biological network

card image

modular sub-networks that are very structured

most commonly seen network in nature

  • probability (p) of linking decreases with number of links
  • high clustering (c) of nodes (decreases with number of nodes)

3 types of mutations in PPI networks

card image

rate of protein mutations (with respect to PPIs)

  • initially thought that proteins with less interactions evolve more rapidly than those with more interactions
  • found strong correlation between slower rates of change for proteins with more interactions
    • more PPI= more conserved
    • fisher's theory (incorrect): "cost of complexity"- the more complex a protein is (more pleitropy and centrality) the less the protein can adapt
    • Truth: proteins with large pleitropy can still have adaptive mutations, but they must be followed by compensatory mutations


  • the "effect" of the mutation
  • larger pleitropy= larger effect on the organism/protein
    • = larger centrality (measure of pleitropy)
  • fisher's theory (incorrect): "cost of complexity"- the more complex a protein is (more pleitropy and centrality) the less the protein can adapt
  • Truth: proteins with large pleitropy can still have adaptive mutations, but they must be followed by compensatory mutations


card image
  • pathway alignment algorithm for identifying conserved pathways in 2 PPI networks
  • finds gaps and mismatches
  • finds best pathway based on the probability of homology between paired proteins (using regular BLAST) and the probability of a false-positive in the PPI assay (TAP-MS)
    • best path= highest alignment scores
  • results were not very reproducible and didn't really tell us anything we didn't already know (ex. some paths are more conserved than others)


  • Metagenomics: analysis of DNA sequences that were sampled from some environment
  • Bioinformatics is central to the practice of metagenomics
  • good way to discover new proteins/genes in nature
    • use conserved primers in a sample and see what you recover (since most proteins/DNA will bind to a conserved primer)
  • led to the study of microbial communities
  • started with Sanger sequencing 16S rRNA genes from a sample of sea water
    • 16S rDNA is still used for high-level taxonomic classification

bioinformatics pipeline

  • computational analysis protocol that can be standardized across studied and data sets
  • goal: automate as much of a pipeline as possible
  • detailed explanation of how your analysis process works (like a flowchart)

importance of databases in metagenomics

  • to identify something as "novel" you need to search a huge database to make sure it is actually new
    • check to make sure it doesn't just seem new because it isn't contained in your database

Clusters of orthologous groups (COGs)

  • database that attempts to annotate functional protein sequences from a wide variety of genomes
    • purpose: functional categorization/prediction
  • reciprocal BLAST
  • if you perform BLAST and 3 genes each return each other as the best hits, then they are a COG
    • assume all 3 have similar functions
    • works in trios
  • use COGs to determine the function of a new protein


  • sequencing preserved biological remains
  • uses bioinformatics techniques inherited from metagenomics
  • environment the organism lived in affects your results
    • your sample will contain your organism of interest + viruses/bacteria, other species, environmental sequences, etc.
    • need to filter out that other stuff

length of reads in metagenomics

  • longer reads are better than short for taxonomic assignment and gene discovery
    • tells you how many of those reads are in the genome (frequency of reads)
  • short is better than long for taxonomic diversity
    • higher sampling density
  • long reads could be from 454 pyrosequencing and short from Illumina
  • use sampling techniques that compliment each other

environmental genomics

richness of biotic environments (protein diversity) mostly levels off after you sample so much of a genome

functional metagenomics: can compare function of protein found in different environments with clustering algorithms



  • human body has 10x more microbial cells than human cells
  • variation in microbial communities between individuals and between different parts of the body
    • what types of bacterial are doing what function may vary, but the function always gets done
    • 3 main communities of bacteria: skin, mouth and gut
  • infant microbiome starts off fairly dynamic, but it converges by adulthood
  • obese people have less bacterial diversity, but they have an increased capacity to harvest energy