rack with legs so i can tilt back

Towards Constructing Physical Maps by Optical Mapping:
An Effective, Simple, Combinatorial Approach
(Extended Abstract)
S. Muthukrishnan*
AbstrTowards Constructing Physical Maps by Optical Mapping:
An Effective, Simple, Combinatorial Approach
(Extended Abstract)
S. Muthukrishnan*
Abstract
We initiate the complexity study of physical mapping
with the emerging technology of Optical Mapping
(OM) pioneered by the team lead by David Schwartz
at the W. M. Keck Laboratory for Biomolecular Imaging,
Dept of Chemistry, NYU. In currently popular
electrophoretic approaches, information about the relative
ordering of the fragments comprising the DNA
molecule is lost, thus leading to difficult computational
problems of composing the fragments in to a physical
map depicting their relative order. In contrast, the
relative ordering of the pieces is readily obtained in
OM. However, OM faces serious technological challenges
as it has low resolution and is fault-prone.
We take a combinatorial approach Qo the problem
of constructingphysical maps from the erroneous data
generated by OM. We identify two abstract problems
in this context, namely, the Exclusive Binary Flip-Cut
and Exclusive Weighted Flip-Cut problems. For both,
we present polynomial time approximation schemes.
However, our main con,tribution here is an extremely
simple laeuristic algorithm that rapidly and accurately
(with in 3% error) constructs the physical map from
input data with immense experimental errors and imprecision
(even. with only 10% expression of a restriction
site in the molecules).
Our strong experimental results, while being preliminary,
seem to indicate that although OM has immense
experimental imprecisaon, the errors appear to
-
*Bell Labs, Lucent Technologies. Work partly done while
at Dept. of Computer Science, Univ. of Warwick, UK, and
while visiting DIMACS partly supported by the NATO grant
CRG 960215.
+Dept. of Computer Science, NYU, USA,
parida&zs.nyu.sdu.
Permission to make digital/hard copies ofnll or pnrt ofthis material for
personal or cla.ssroom use is granted without fee provided that the copies
‘we not made or distrihuted for profit or commercinl adwntlge, the copyright
notice, the title ofthe publication and its dnte appear, and notice is
given that copyA@ is hy permission of the ACM, Inc. TO copy otherwise.
to republish, to post on servers or to redistribute to lists. requires spccilic
perwisrion nnd/or fee.
KECOMIl 97, Snnt;? Fc New Mccicu 1rSA
Copyri&t 1997 ACM O-8979 I-X82-8/97/0 I .,$3.X)
Laxmi Paridat
be ‘?ocal” and hence more easily manageable than
the ones in other approaches where the errors appear
“global”. Also, although OM may not be suitable for
producing physical maps at the resolution of few base
pairs, our results indicate that it may be appropriate
for rapidly generating accurate physical maps at the
resolution of a few 100’s of base pairs.
1 Introduction
A step towards the ultimate goal of many efforts
in Molecular Biology (including the Human Genome
Project), namely to determine the entire sequence of
Human DNA and to extract the genetic information
from it, is to build physical maps of portions of the
DNA [9, 51. A physical map merely specifies the location
of some identifiable markers (restriction sites of
up to 20 base pairs) along a DNA molecule. Physical
maps provide useful information about the arrangement
of the DNA, and they serve as recognizable posts
to help search it. In this paper, we propose and study
the complexity of a combinatorial approach to constructing
physical maps of medium sized molecules
(20K - 40K base pairs long) using an emerging technology,
called Optic al Mapping [ 151.
There are several known technological approaches
to building physical maps with their associated computational
problems [lG, 1, 10, 12, 13, 71; most of
these use restriction enzymes. A restriction enzyme
is an enzyme that recognizes a unique sequence of
nucleotides and it cleaves every occurrence (called a
restriction site) of that sequence in a DNA molecule.
In a well-established approach to physical mapping, a
restriction enzyme is applied to cleave the molecule
at these restriction sites producing pieces of the
molecule. In this process, the information about their
relative positioning is lost. Thus we are faced with the
problem of assembling these pieces into their relative
order: this leads to difficult combinatorial and computational
problems (such as the partial digest problcm,
probed-partial digest problem. etc.) most of which
arc NP-hard, and many of which have been exten-
209
sively studied from the point of workable heuristics
(See [9, 7, l] etc. and Section 3 of [14] for several
open problems in this area).
An alternative approach to physical mapping is
based on a new technology pioneered by David
Schwartz at the W. M. Keck Laboratory for Biomolecular
Imaging, Dept. of Chemistry, NYU, called the
Optical Mapping (OM) technology [8, 11, 151. At a
very high level, here is an overview of that method.
Single strand of a DNA molecule is attached to the
surface of a slide by electrostatic forces. Then it is
treated in a controlled manner with a restriction enzyme.
The molecule still remains attached to the slide
although the restriction sites get digested by the enzyme.
Now by applying appropriate fluorescent dyes,
the molecule may be viewed under a microscope or
recorded by a camera as an image on a Computer.
For a more detailed description of this complex process,
see [8, 11, 151.
As it is clear from our overview of OM, the relative
order of the pieces is not lost. In fact, the image
itself is a physical map (although perhaps not at desirable
levels of resolution, and not in a form compatible
with genomic data we handle now). III this sense, this
technology seems to cut through the Gordian Knot of
physical mapping’ described above faced by current
technologies, such as gel electrophoresis. However,
OM too faces severe difficulties: at the core, the technological
process is highly error-prone. Some such
issues are: (;) poor digestion of restriction sites and
physical factors such as the coiling of DNA and fragments
getting washed away, leading to high rate of
false negatives and false positives, (ii) noise and lack
of precision in capturing and processing images, and
(iii) crude measures of parameters such as intensity,
length etc.. Thus the problem of physical mapping is
not immediately solved by OM. Nevertheless, it is a
promising technology that is being made more robust
(See the second generation versions in [ll]).
In this paper, we consider the computational problem
of constructing physical maps from the OM technology.
For exposition in this section, consider the
following idealized version of the problem. The image
processing software, after analyzing the image obtained
from OM, generates a discretized binary string
of the molecule indicating the presence of restriction
sites along it. This resolution is not at the level of
base pairs (bps)‘. If the technology were perfect, that
will suffice as a physical map (modulo the resolution).
However, because of poor digestion rates, not all sites
are represented in that string. In order to get all
the sites, several experiments (100’s) are done on the
same molecule (but with different sample molecules)
‘It was a knot tied by Gordius, king of Phrygia, held to
be capable of being untied only by the future ruler of Asia;
it was unceremoniously cut by Alexander the Great with his
sword! Now the phrase “cut the Gordian Knot” is used to
mean solving an intricate problem in a surprisingly different,
highly effective manner.
2For a molecule of 20000 base pairs, the discretized string
has r~sr~ally 200 positions.
and the same restriction enzyme. Thus the restriction
sites will be those obtained by consensus from these
experiments. Of course now there are basic technological
problems getting the discretized strings with reasonably
consistent alignment of the string positions.
However, the major conceptual problem is that the
different samples are not necessarily laid down along
the same direction on the slide. Specifically, each sample
is laid down along one of two anti-parallel directions.
There are exponentially many alignments of
the string positions depending on how each sample is
laid and informally the problem we study here is to
decode the direction for each molecule and isolate the
consensus restriction sites. Formally we study an optimization
version of this problem which we call the
Binary Flip-Cut (BFC) problem. Handling real data
is considerably harder. In particular, the positioning
of the restriction sites as reported by the imaging
software may not be accurate. For this case, we
generalize the version above to the Weighted Flip-Cut
(WFC) problem and study that as well. (See Sections
2 and 3 for the precise definition of the problems).
Our contributions are as follows. First, we initiate
the study of the computational complexity of physical
mapping by OM. In particular, we take a combinatorial
approach and formulate two novel problems,
namely, the BFC and WFC problems. In solving
these problems, we reduce them to certain dense, hard
optimization versions that wc call the exclusive BFC
and exclusive WFC problems respectively. Our main
technical contribution is theoretical, and more importantly,
efficient practical results for exclusive versions
of BFC and WFC problems. Our theoretical result is
a strong approximation result: a polynomial time approximation
scheme for them (that is, a polynomial
time algorithm that for any fixed fraction E, produces
a solution that is at least 1 - e of the maximum (optimal)
solution).
The bulk of what we consider our contribution
comes from our simple heuristic algorithm for the exclusive
BFC and WFC problems (the core of BFC
and WFC problems that is hard). It is an appropriate
greedy algorithm that may be viewed as doing
limited backtracking; as primitives, it merely uses
sorting and bookkeeping. We do not prove any thing
nontrivial for this heuristic (it is a 0.5 approximation
algorithm but that is trivial for exclusive BFC and
WFC problems). But this algorithm is extremely
accurate in predicting the direction of each sample
molecule and the consensus restriction sites. See 4 for
detailed descriptions and figures. To sum, we claim:
our simple heuristic, running on Sun Spare Station
2, rapidly (< 1 min) and accurately (gross ouerestimate
of 3% or 1000 bps error) computes the physical
map of medium sized molecules (40K) from real data
with immense experimental and image processing error
(most restriction sites having only 10% expression
in molecules)3
‘As a digression, consider the compromise in the quality
due to the error in our algorithm. The gross upper bound of
210
We discuss three further points. First, why does
our simple heuristic algorithm perform as well as it
does? (In contrast, for physical mapping arising from
other technological approaches, sophisticated heuristics
such as Lin-Kernighan heuristic, or Hamming
metric TSP were used ). We believe the explanation
lies in the strength of the OM technique.
Although it measures lengths and other parameters
coarsely (in contrast to say gel electrophoresis) the
errors are local such as in boundary of the restriction
sites etc (in contrast, in currently prevalent approached
to physical mapping, interaction between
clones far apart can affect the quality of data and
therefore errors appear “global”). Therefore, local
search methods such as ours will tend to work well.
Second, does our result bring any insight to OM
technology? Following from the point above, perhaps
it is true that although OM is more error-prone,
the errors arc computationally more manageable since
they have a “local” nature. Also, from our experimental
evidence, WC bclievc that the data from OM can
be rapidly analyzed to obtain fairly accurate physical
maps although not rcfincd to the level of bps. Combined
with the potential for automating the entire
process, this might be the strength of OM (as opposed
to generating data for very high quality physical
maps given more computational resources). David
Schwartz, the pioneer of OM, expressed this intuition
in a personal communication, before we began work
on this problem.
Finally, how far is the goal of physical mapping
by OM resolved by our work? There are several other
combinatorial formulations and cost functions we can
envisage. Non-combinatorial approaches (eg., prohabilistic,
maximum likelihood) are relevant as well,
and some of these are currently under investigation
[2]. It remains to be seen how these formulations and
solutions compare with ours. Also, OM is an evolving
technology. Therefore, new technical problems arise
with changes in the laboratory procedures. In this
paper, we have tackled only one version for which we
obtained the real data from the lab.
Map. In Section 2, we describe our results for the
BFC problem. We sketch the modifications to handle
real data in Section 3 using WFC. In Section 4,
we present a small sample of our experimental results
with the real data.
2 The BFC problem
In this section, we consider the binary flip-cut problem
(BFC) informally stated below. Given n. binary
molecules each with m sites, determine a subset of
sites (called the cuts) and an assignment of flip or no-
3% error scales to an error in placement of a restriction site
of roughly 1000 base pairs (bps). This segment of ambiguity
is well within the limits of crmrent sequencing technologies, so
if WC needed more refined physical maps, we can do so by additional
conventional sequencing guided by the output of our
algorithm.
flip to each of the molecules so that the number of con-
S~~SUS cut sites is minimized; a cut site is a consensus
one under an assignment of flips to the molecules if at
least CTZ l’s line up on that site when the molecules are
flipped accordingly, for some small constant parameter
c. A flip of a molecule is its reversal. In reality, c
depends on various experimental parameters such as
the false positive and false negative rates, enzyme digestion
rate etc. Although there is no inherent reason
to look for minimizing consensus cut sites, in the absence
of additional discriminatory evidence, seeking
such “minimal” explanation for the input data seems
suitable. Throughout the paper, the conjugate of column
i is the column m+l -i; we denote the con,jugate
of i by ;.
Even though we formalized the problem combinatorially
as above, in our approach to its solution we
kept the spirit of the underlying problem in mind.
Specifically, our approach to solving this problem is
the following two step process. In the first step, called
the eliminntion step, we eliminate sites and their con-
,jugates in pairs as described below. Eliminating the
sites in conjugate pairs means that positions which
might map on to each other because of flips continue
to be able to do so. Thus this elimination is a reduction
that does not affect the optimization criteria on
the remaining sites owing to the molecule flips. In the
second step, we solve a more specific problem, namely
the exclusive BFC problem which is the original BFC
problem except that for each conjugate pair i,;, precisely
one of them may bc a cut site. For a collection
of molecules, this fires the number of cut sites and
therefore we need alternate optimization criteria for
this problem. We chose the total number of l’s in the
cut sites as this measure.
Therefore, formally, the exclusive BFC problem is
as follows. Given n binary molecules of m sites each,
determine the flip for each molecule and an assignment
of either i or ; as a cut (but not both) for i,
1,< i 5 m/2, such that the total number of l’s in
the cut sites is maximized. Note that we can assume
without loss of generality that m is even since otherwise,
we can remove the middle site, that is, the site
(m +- 1)/2, and the problem remains unchanged.
Step 1. In the elimination step, we remove two types
of sites (in conjugate pairs) from consideration. First,
we remove those sites i and 2’ that have fewer than nrr
of l’s each; here or (say, l/50) is a parameter we set
from the knowledge of the error parameters in the cxperimental
set up. We look upon these as sites where
there is no underlying cut, but some molecules display
the cut owing to false positive errors. Second,
we remove all those sites i and k where the sum of
the number of l’s in them exceeds nr,, for a parameter
r,, (say l/10), again set from the knowledge of
the error parameters in the experimental set up. We
look uoon these as sites where there are cuts at i and
i;. Now we hypothesize that the remaining sites have
the property that precisely one of i or its conjugate
; will be a cut, and that reduces the problem to the
211
exclusive BFC problem above.
We remark that the description above is only conceptual
and that implementation details differ. For
instance, we do not explicitly set rr, a priori. We consider
the sites in order of decreasing “suitability” for
the exclusive BFC problem and we discard trailing
sites which has the effect we state above. Also, the
precise values for r’s have to be carefully set to filter
the two types of site. For instance, assume all
the molecules are parallel and i is a cut while ; is not.
Then, i has several l’s as determined by the false negative
errors and ; has a few l’s due to false positive
errors. But in reality the molecules are not all parallel
and a substantial fraction of them are in a flip state.
In that case, the l’s in their site i appear as l’s on ; in
the input. Thus a cut site might have number of l’s
anywhere between the one we expect from false positive
rates and that from false negative rates because
of molecule flips. For this reason, we found that it was
effective to set the thresholds in terms of not merely
the number of l’s in the columns, but also in terms
of the l’s common to a column and its conjugate.
Step 2. This is the technical crux. We make some
observations about the structure in the exclusive BFC
problem.
Theorem 1 Given an assignment of flips to the
molecules, we can determine the assignment of cuts
at i or at zi (but not both) for each i, such that the
number of l’s in the consensus cuts is maximum in
O(nm) time zn all. Similarly, given an assignment
of ercclusiue consensus cuts amongst i and i for each.
%, 1 < i < n/2, me can determine an asszgnment of
flzps to the molecules, so that the number of l’s in the
consensus cuts is maximum, again in G(nm) time.
We omit the algorithms for both the parts of the
theorem above; in both cases, simple greedy approach
works. That theorem is useful since any solution we
find for the exclusive BFC problem may be postprocessed
by retaining one of the two sets of answers
(namely the flip assignment or the cut assignment)
and optimizing for the other on that basis and thereby
hope to improve local non-optimal solutions.
In what follows, we present a theoretical approximation
algorithm for the problem, and a simple
heuristic that is highly effective in practice.
2.1 Exclusive BFC Problem: Theoretical Solution
Here we provide a polynomial time approximation
scheme (PTAS) for the exclusive binary flip-cut problem.
For simplicity, we consider only the case n =
O(m) and leave the general scenario for the final version.
Theorem 2 For any fixed E < 1, there is a polynomial
time algorithm that finds flips and cuts for the
exclusive BFC problem with total weight at least 1 - E
of the maximum weight.
Proof. We only show the sketch. We formulate a
quadratic optimization problem from the given flipcut
problem. Let Y; be the indicator variable for site
i, 1 2 i 5 m/2. Then Yi = 1 if i is a cut and it is
0 otherwise (that is, 7 is a cut). Let X; be the indicator
variable for the molecule i. Then Xi = 1 implies
it appears as-is, that is, without being flipped;
Xi = 0 implies it is flipped from the input. Also, Mij
is the site j in molecule i and li?ij = Micrt-j,. The
quadratic optimization problem is:
j=n/2
ma C& Cj,l G(MijXi + (1 - Xi)Mij)+
(1 - K)(XiIMij + (1 - Xi)Mij)
Y;: =O,l; Xi =O,l
Rearranging terms, the objective function becomes:
max Cjz: Ciz:“2 (YjXi(Mij -Mij) + YjMij)
-KXi(Mij - Mij) - k;Mij
+Xi(Mij - Mij) + Mij
Collecting terms, the objective function becomes:
j=nJ2
m=Cir Cj=l 2YjXi(Mij - Mij)
+Yj(Mij - Mij)
+Xi(Mij - Mij) + Mij
Let W’ be the maximum solution for the above.
We claim without providing the proof that we can
now use recent techniques due to Arora et al [4] to
conclude the following: for every E < 1, there is a
polynomial time algorithm which solves the above and
returns a solution W such that W 2 W’ - en’.
We also claim a lower bound on W’, namely
W’ = ~(Tz”). This is because, consider the flip which
provides the optimal value W’. For each i, clearly we
can choose the column i or its corijugate whichever
has more l’s than the other. That way W’ 2 t?
where the first term on the right is the number of cuts
and the second term is the lower bound on the average
of the number of l’s in a conjugate pair of cuts.
Since rr, is a constant, it follows that W’ = n(n”).
Combining both, we get a PTAS for the exclusive
BFC problem. 0
2.2 Exclusive BFC Problem: Practical Solution
In this section, we describe an extremely simple algorithm
for the exclusive BFC problem. We do not
guarantee any approximate or exact performance for
this algorithm (except that it is a l/2 approximation,
but that is trivial). However as our experimental results
show, this algorithm is remarkably accurate in
predicting the consensus cuts on both synthetic data
and on real data.
In what follows, we specify the main ingredients
of the algorithm. Our algorithm resembles a greedy
algorithm at the high level. However, there are two
orthogonal manners to be greedy about (namely by
fixing the consensus cuts or by fixing molecule flips).
We try to attain as many l’s as possible in candidate
212
cut sites by greedily hipping the molecules appropriately
to the extent we can. However, as we accumulate
cut sites, existing molecule orders hinder procuring
additional potential 1’s. In that case, we reverse
the flips of the molecules selectively before proceeding
further. This may be thought of as limited backtracking
on the choice of molecule flips. Although one can
envisage situations where more levels of backtracking
will improve the solution (we can construct such data
sets easily), our experimental experience suggests that
such a limited backtracking suffices. We also experimented
with incorporating limited backtracking on
the cut sites, and although there are cases where it
improves the performance, our strong experimental
intuition is that it is not crucial.
In what follows, we only provide the sketch of the
algorithm. We experimented with a number of variations
of this algorithm differing in details and the
experimental results based on those variations and
their comparisons will be presented in the full version
of this paper. In this paper, we present experimental
results based only on an implementation that is
closely related to the description below.
Short Algorithm Sketch. We first calculate for
each site, its potential for getting 1’s. That is, for
each site i, 1 5 i < n/2, we calculate Ci, the maximum
number of l’s that can be made to align at
that site by flippig the molecules. That is, Ci =
[{jlM;j = 1 OR Mij = l}/. Then we consider sites
in the decreasing order of their potential. For this,
we sort the Ci’s in decreasing order and process them
in that order, for each deciding whether i or its conjugate
; should be designated as a cut site. Say we
have processed j of these. For each molecule i, we
keep dl:f;, which is the number of cut sites it has 1
in, minus the number of the conjugates of these cut
sites it has a 1 in. For a subset of the molecules,
we would have assigned flip directions (called touched
molecules) and others are untouched. All molecules
with flip directions i will satisfy difi > 0, and all
untouched molecules have d;f = 0.
Now we show how we add the j + lth of the sites
 
Hey, thanks for bringing this thread back. I needed something to cheer me up, and now I remember that Big Ray is gone, and that makes me happy.:)
 
Single strand of a DNA molecule is attached to the
surface of a slide by electrostatic forces. Then it is
treated in a controlled manner with a restriction enzyme.
The molecule still remains attached to the slide
although the restriction sites get digested by the enzyme.
Oh come on... It's not electrostic forces, it's goo...

You had me going up till that part...
 
Back
Top