News aggregator
More DPH
ANNOUNCE: Data.IVar 0.1
Network.URI bug? (2.2.0.0)
Creighton Hogg: Working through Genetic Programming
import Random import Control.Monad import Control.Monad.Random import Data.Ratio import Data.List (sortBy)
So I really did accomplish my reading of the first three chapters of Koza's first book on Genetic Programming the week I said I would, but I sat on this post for a good bit afterwards. There wasn't a lot of meat in those initial chapters except for a few salient points:
- Genetic programming can give you good answer if you're willing to accept that the answer is not 'correct', merely accurate and if you're willing to lay aside preconceived notions of beauty in your solutions.
- Due to the schema theorem, a genetic recombination/fitness based search can effectively cover a large swath of the solution space with a small population.
The majority of Ch. 3 of the book is devoted to Old School genetic algorithms and the schema theorem. I'm not going to bother going through fully fledged examples of the schema theorem, but I think the point can be understood fairly intuitively: when you proportionally select the fittest specimen to be in the mating pool of each generation, you are making judgements about the fitness of patterns of genes and thus gather information about not just the specimen you have but all the possible specimen that have overlapping genetic structure.
What I want to do in the rest of this post is just write my own tiny, not terribly well-featured, genetic algorithm implementation. I'm hoping that this will provide not only more practice for me, but also be instructive to others. I base the implementation more upon the heuristic description that Koza gives than on anything else and make no claim that this is any kind of optimal construction.
First, since I hope to express my code at least somewhat generically between genetic algorithms and genetic programs, I'm going to define a typeclass to capture the basic behavior required of a "genetic" structure. Also, it is worth noting that in this post I'm going to make gratuitious use of the Control.Monad.Random library just because it feels somehow appropriate to be doing these calculations in a monad.
class Genome a where mutation :: MonadRandom m => a -> m a recombination :: MonadRandom m => a -> a -> m (a,a)
Oddly enough, that didn't seem too terrible. We have no notion of how often something like mutation in the typeclass, because that should be controlled at the level of the actual implementation.
Now, though, we'll define an instance of this class for lists as that's probably the easiest thing to do. I do, however, make the implicit assumption that in a given simulation all the specimen are of the same length.
instance Random a => Genome [a] where mutation xs = do i - getRandomR (0,length xs - 1) x' - getRandom return $ replaceAt i xs x' recombination xs xs' = do i - getRandomR (1,length xs - 2) let (xs1,xs2) = splitAt i xs (xs1',xs2') = splitAt i xs' return (xs1++xs2',xs1'++xs2) replaceAt i xs x = xs1++[x]++(drop 1 xs2) where (xs1,xs2) = splitAt i xs
At this point we can start putting together the real framework for a genetic algorithm run. The first, and relatively easiest function to compute will be the one to choose the mating pool for the next generation given a fitness function and a list of specimen. I assume a constant population size for the course of the simulation, so the outgoing list will have just as many members as the in going list. We also do our selection based entirely upon the relative fitness between the specimen of the population, an idiom captured quite well in the fromList function in the Control.Monad.Random library.
chooseMatingPool :: MonadRandom m => (a -> Integer) -> [a] -> m [a] chooseMatingPool eval pop = replicateM l $ fromList weighted where weighted = map (\p -> (p, eval p % 1)) pop l = length pop
Now I'll admit that I'm limiting the fitness function a little by requiring it to return Integer values, but I honestly feel like that's a generality that I'm willing to lose for the purposes of this post and for the convenience of using fromList.
Next will be the handler that sexually recombines the mating pool. I cheat a little here and presume that since the order of the mating pool was selected randomly that I don't need to shuffle them any more when I perform the selection and thus I can just take pairs of specimen, breed them, then put them back in the list. If there's an odd man out, or as I like to call him "Creighton in High School", then we don't breed him with anyone and just put him back in the pool.
The function to implement mutation in the population is also extremely trivial so I'll throw it in below without much ado.
breed :: (MonadRandom m, Genome a) => [a] -> m [a] breed [] = return [] breed (x:[]) = return [x] breed (x:y:xs) = do (x',y') - recombination x y xs' - breed xs return $ x:y:xs' mutate :: (MonadRandom m, Genome a) => Double -> [a] -> m [a] mutate chance pop = mapM (mutate' chance) pop where mutate' c a = do result - getRandomR (0,1) if result c then mutation a else return a
Alright, so the next minor hurdle is going to be generating the initial population we need. Now, my cheating with lists has painted us into a slight corner where we can't be as generic as we'd like. The problem is that there are two conflicting imperatives at work here. First, that lists are a very convenient way to deal with sets of genes. Second, that the lists in a single simulation must all be of the same size. The result is that I can't write any generic instance such as
instance (Random a) => Random [a] where
...
useful in all instances, since I want to be able to use different size lists, or arrays for that matter, for different simulations. What would be the slickest solution? Depedently typed arrays with the size as a part of the type. However, in a world without dependent types we can still write something that works even if the lack of clean semantics annoys me.
In any case we can now start wrapping the whole thing up a bit more into larger control structures. I think these are fairly straight forward, but the main function is runSimulation. It takes in the number of runs to execute, the performance threshold for declaring an early end, the probability of mutation, the evaluation function, and the initial population. That's a lot of parameters!
randomList :: (Random a,MonadRandom m) => Int -> m [a] randomList i = liftM (take i) getRandoms initialListPopulation :: (Random a, MonadRandom m) => Int -> Int -> m [[a]] initialListPopulation s l = sequence $ replicate s $ randomList l simulationStep :: (MonadRandom m, Genome a) => Double -> (a -> Integer) -> [a] -> m [a] simulationStep mut_chance eval pop = chooseMatingPool eval pop >>= breed >>= mutate mut_chance maxInPopulation :: (a -> Integer) -> [a] -> (a,Integer) maxInPopulation f = head . reverse . sortBy (\(_,w) (_,w') -> compare w w') . map (\p -> (p,f p)) runSimulation :: (MonadRandom m, Genome a) => Int -> Integer -> Double -> (a -> Integer) -> [a] -> m a runSimulation 0 _ _ eval pop = return $ fst $ maxInPopulation eval pop runSimulation runs threshold mut_chance eval pop = do if snd (maxInPopulation eval pop) >= threshold then return $ fst (maxInPopulation eval pop) else do newpop - simulationStep mut_chance eval pop let finalpop = takeBest eval pop newpop runSimulation (runs-1) threshold mut_chance eval finalpop takeBest :: (a -> Integer) -> [a] -> [a] -> [a] takeBest eval xs xs' = take (length xs) . reverse . sortBy (\p p' -> compare (eval p) (eval p')) $ xs++xs'
Now we have enough framework in place that we can introduce an actual, executable, example. In accordance with a tutorial I saw once, we'll evolve the string of all 'True's as a quick test.
evalOnes :: [Bool] -> Integer evalOnes = foldr (\b s -> let v=if b then 1 else 0 in v+s) 0 main = do p - initialListPopulation 50 20 final - runSimulation 200 20 0.01 evalOnes p print final
I've done some preliminary testing and found that for this size of problem it still quite often finds the list of 20 Trues within the 200 generations. I'm sure if I played a bit more with how the population was chosen I could get it to be much more optimal, but I plan to worry more about that when I tweak my code to accommodate monotypic genetic programming instead of just fixed length genetic algorithms.
Random Thoughts and Idle Speculation
So first off, I found this code extremely easy to write. Now that I'm pretty much using Haskell for all of my personal projects and toys, I've come to appreciate just how effortless writing code starts to feel. It's just a language that stays out of my way when I'm trying to figure out a problem, and that's about the best compliment I think I can give a tool.
Also, as I've continued reading Koza's book I'm starting to wonder about what exactly makes genetic programming so much more powerful than straight up genetic algorithsm. I'm guessing it really comes down to the richness and varying size of the structure being evolved rather than the fact that you're evolving a parse tree. There's some ideas tickling at the back of my head about what about qualities a data structure, a functor, needs to satisify to be "evolvable". Hopefully I'll be able to clarify some thoughts about that as I continue reading the book.
Well-Typed.Com: Some ideas for the Future of Cabal
I presented a “Tech Talk” today at Galois on some ideas relating to Cabal
- Slides (pdf)
We discussed two topics. Here’s the abstract:
A language for build systemsBuild systems are easy to start but hard to get right. We’ll take the view of a language designer and look at where our current tools fall down in terms of safety/correctness and expressiveness.
We’ll then consider some very early ideas about what a build system language should look like and what properties it should have. Currently this takes the form of a design for a build DSL embedded in Haskell.
Constraint solving problems in package deploymentWe are all familiar, at least peripherally, with package systems. Every Linux distribution has a notion of packages and most have high level tools to automate the installation of packages and all their dependencies. What is not immediately obvious is that the problem of resolving a consistent set of dependencies is hard, indeed it is NP-complete. It is possible to encode 3-SAT or Sudoku as a query on a specially crafted package repository.
We will look at this problem in a bit more detail and ask if the right approach might be to apply our knowledge about constraint solving rather than the current ad-hoc solvers that most real systems use. My hope is to provoke a discussion about the problem.
I’ve covered similar ground to the first topic before in a previous blog post on make. This time we looked in a bit more detail at what a solution might look like and what properties it should have. In particular we discussed what the correctness properties of a build system might look like.
Johan Jeuring: Finding palindromes
When I started as a PhD student in Computer Science in 1988, my professor gave the following problem assignment to me. Construct an algorithm that, when given a string, finds the longest palindrome substring occurring in the string. For example, given the string "yabadabadoo", the algorithm should return the substring "abadaba". The algorithm should not only return the longest palindrome substring, it should also return it fast. The requirement was that the algorithm should only use a number of steps linear in the length of the argument string. This was a difficult problem, which gave me severe headaches in the months to follow. I spent four months on the problem, and solved it. The one comment I got from my professor was "The best thing about this problem and its solution is that they have absolutely no practical relevance."
He was wrong.
Since 1988 I have found that palindromes play an important role in the world around us. Palindromes appear in music, in genomes, in art, architecture and design, mathematics, physics, chemistry, etc.
Particularly interesting is the fact that palindromes occur in the male genome: of the about 20,000,000 characters representing the male-specific region of the Y chromosome, 5,700,000 characters form together eight large palindromes (off by a couple of characters). I don't think there is any proven explanation for the presence of palindromes in the male genome, but I suspect it has something to do with repairing genomes. If you know more about it: please let me know.
As you may or may not have noticed when working on repairing Endo's DNA, the cloud is corrupted. It turns out that the cloud's genome also is a palindrome (off by a couple of characters), and that the corruption can be repaired by taking the mirror image of the palindrome. To do so, you have to find the palindrome. In the case of Endo, this is pretty easy: `duolc' follows cloud immediately in the symbol table. Had this information not been in the symbol table, you would have to find this palindrome in the DNA by means of an algorithm for finding palindromes. Actually, the DNA for the cloud is only 6,500 acids long, and using a naive quadratic-time algorithm for finding palindromes is not a real problem if you have a sufficiently fast machine. A quadratic-time algorithm for determining palindromes in human male DNA isn't going to work: it would take a long time to find the palindromes.
This blog message explains how to find palindrome substrings in linear time. I will develop a program in Haskell for finding exact palindromes. The problem of finding approximate palindromes is left as an exercise to the reader :-) Interestingly, the algorithms for finding palindromes developed by computational biologists generally use suffix trees, and both space and time consumption seem to be worse compared with the algorithm I give in this blog message. But descriptions of the algorithm given here have been published in the eighties (by Galil and others, described in RAM-code) and nineties (by me, described in Squiggol, which is probably even harder to decypher than RAM-code) of the previous century, which is a long time ago of course.
This blog message is a literate Haskell file, which can be interpreted with ghci.
The type of a function for finding palindromes
I want to find the longest palindrome substring in a string. The first attempt at a type for a function longestPalindrome is hence:
longestPalindrome :: String -> String
To determine whether or not a string is a palindrome, I have to compare characters at different positions in a list. It would be helpful if I had random access into the string. For that purpose, I'm going to change the input type of the longestPalindrome function to an array. Furthermore, the longestPalindrome algorithm can also be used to find the longest palindrome in a list of integers, or any other type on which I can define an equality function. So here is the second attempt at a type for a function for finding the longest palindrome:
longestPalindrome :: Eq a => Array Int a -> Array Int a
To find the longest Palindrome in an array, I have to calculate the longest palindrome around each position of the array, where a position in an array is either on an element, or before or after an element. For example, the array corresponding to [1,2,3] has 7 positions: before the 1, on the 1, before the 2, ..., until after the 3. So I want to express the function longestPalindrome in terms of a function longestPalindromes of type
longestPalindromes :: Eq a => Array Int a -> [Int]
where the result list is the list of the lengths of the longest palindrome around each position in the argument array. For example,
?longestPalindromes (arrayList (0,2) [1,2,2])
[0,1,0,1,2,1,0]
I will omit the definition of longestPalindrome in terms of longestpalindromes, and define function longestPalindromes below.
A naive algorithm for finding palindromes
A naive algorithm for finding palindromes calculates all positions of an input array, and for each of those positions, calculates the length of the longest palindrome around that position.
module Palindromes where import Data.Array longestPalindromesQ :: Eq a => Array Int a -> [Int] longestPalindromesQ a = let (afirst,alast) = bounds a positions = [0 .. 2*(alast-afirst+1)] in map (lengthPalindromeAround a) positions
Function lengthPalindromeAround takes an array and a position, and calculates the length of the longest palindrome around that position.
lengthPalindromeAround :: Eq a => Array Int a -> Int -> Int lengthPalindromeAround a position | even position = extendPalindromeAround (afirst+pos-1) (afirst+pos) | odd position = extendPalindromeAround (afirst+pos-1) (afirst+pos+1) where pos = div position 2 (afirst,alast) = bounds a extendPalindromeAround start end = if start 0 || end > alast-afirst || a!start /= a!end then end-start-1 else extendPalindromeAround (start-1) (end+1)
For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.
A linear-time algorithm for finding palindromes
I now describe a linear-time algorithm for finding palindromes. Although the program is only about 15 lines long, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.
The algorithm processes the input array from left to right. It maintains the length of the current longest tail palindrome, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. Function longestPalindromes is expressed in terms of function extendTail.
longestPalindromes :: Eq a => Array Int a -> [Int] longestPalindromes a = let (afirst,alast) = bounds a in extendTail a afirst 0 []
Function extendTail takes an array as argument, the current position in the array, the length of the current longest tail, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. There are four cases to be considered in function extendTail. If the current position is after the end of the array, we only have to add the final palindromes in the array to the list of longest palindromes, for which we use the function finalCentres. If the current longest tail palindrome extends to the start of the array, we extend the list of lengths of longest palindromes by means of function extendCentres. If the element at the current position in the array equals the element before the longest tail palindrome we extend the longest tail palindrome. If these elements are not equal, we extend the list of length of longest palindromes.
extendTail a n currentTail centres | n > alast = -- reached the end of the array finalCentres currentTail centres (currentTail:centres) | n-currentTail == afirst = -- the current longest tail palindrome -- extends to the start of the array extendCentres a n (currentTail:centres) centres currentTail | a!n == a!(n-currentTail-1) = -- the current longest tail palindrome -- can be extended extendTail a (n+1) (currentTail+2) centres | otherwise = -- the current longest tail palindrome -- cannot be extended extendCentres a n (currentTail:centres) centres currentTail where (afirst,alast) = bounds a
Function extendCentres adds palindromes to the list of lengths of longest palindromes. It takes the array as argument, the current position in the array, the list of palindromes to be extended, and the list of palindromes around centres before the centre of the current longest palindrome tail. It uses the mirror property of palindromes to calculate longest palindromes around centres after the centre of the current longest palindrome tail. If the last centre is on the last element, we call extendTail with a longest tail palindrome of length 1, and the position moved tot he right. If the previous element in the centre list reaches exactly to the end of the last tail palindrome, use the mirror property of palindromes to find the longest tail palindrome. In the other cases, we've found the longest palindrome around a centre, and add that to the list of length of longest palindromes. We proceed by moving the centres one position, and calling extendCentres again.
extendCentres a n centres tcentres centreDistance | centreDistance == 0 = -- the last centre is on the last element: -- try to extend the tail of length 1 extendTail a (n+1) 1 centres | centreDistance-1 == head tcentres = -- the previous element in the centre list -- reaches exactly to the end of the last -- tail palindrome use the mirror property -- of palindromes to find the longest tail -- palindrome extendTail a n (head tcentres) centres | otherwise = -- move the centres one step -- add the length of the longest palindrome -- to the centres extendCentres a n (min (head tcentres) (centreDistance-1):centres) (tail tcentres) (centreDistance-1)
Function finalCentres calculates the lengths of the longest palindromes around the centres that come after the centre of the longest tail palindrome of the array. These palindromes are obtained by using the mirror property of palindromes.
finalCentres 0 tcentres centres = centres finalCentres (n+1) tcentres centres = finalCentres n (tail tcentres) (min (head tcentres) n:centres)
At each step in this algorithm, either the longest tail palindrome is extended, and the current position in the array is moved, or the list of lengths of longest palindromes is extended. Since both the array, and the list of lengths of longest palindromes are linear in the length of the array, this is a linear-time algorithm.
Control.Exception
Type classes question
ghc-6.10.0.20081005 binary refers to (non existent?) libedit.so.0
DAMP 2009 CFP reminder & revised full paper deadline
Hejlsberg and Steele: Concurrency and Language Design
A nice viedo interview of Anders Hejlsberg and Guy Steele on Concurrency and Language Design at the JAOO conference. Nothing too technically deep, but the interview does manage to crystalize some of the high level issues that face language designers.
Speaking of interviews, the Lisp50 Conference at OOPLSA 2008 will have what should prove interesting - Alan Kay interviewing John McCarthy.