Hi Folks;
Waaaay back in 2007 I wrote about coming up with a new development model for dealing with todays multicore systems and larger than life problems. The stories are here, here, here and here. As of here I outlined a problem that would fit our needs in developing such a system. It was a problem that should tax a single-processor system to the point where it would be relatively inefficient to resolve yet had a tangible result. The first step was to create an non-efficient yet runnable program to illustrate the problem and act as a baseline. That is what this article represents. You can read the original article or just take in the outline below:
1. There is nothing totally original in nature or created by man.
2. Anything that *is* considered unique could be decomposed until the uniqueness was gone and essentially all elements of the unique creation could be found in another ‘unique’ creation.
3. In short, any supposedly unique creation was simply the re-assembling of other non-unique elements.
To prove this, I would take two supposedly unique items, lets call them target and source. The source would be some bit of digital work that is unique yet not copyrighted, hence in the public domain. In this case, I am using a JPG of my house:
Not much to look at but will serve as our source. The target is a file that is supposedly completely different than the source. In this case, I am using a picture of my Epi Riviera guitar:
The idea then is that even thought the two look nothing like each other, decompose the target picture into smaller and smaller byte strings and find those strings in the source. Once found, the offset and length of the byte string would be recorded to what I am calling a recipe file. The proof of success is then to take the source and the recipe file and completely (and faithfully) reconstruct the target file from scratch.
The application provided here, what I am calling Bitmimic does exactly that. It does a brute-force decomposition of the target file and finds the byte strings that match in the source file, recording the results in the recipe file. Note that it is intentional that no attempts at optimization of either the recipe file or the searching. The project linked at the end of the file will contain all of the source along with the two JPGs for testing. Feel free to test this with your own digital media. The JPG test will execute relatively quickly (10m for the deconstruction/recipe generation and seconds for the reconstruction); this was done to give you a feel that the system works. However to give the larger concurrent project something real to chew on, the Epi picture will take the place of the source and the target will be a 30 minute video, something like 190M in size. On a relatively recent machine it took 2 days to deconstruct an episode of a 30 minute show and something like 10m to reconstruct it.
About the software:
Bitmimic is written in basic C++ and constructed with CMake which should allow it to build under Windows, Mac and Linux. I do use getopt to grab the command line args but that should be the only platform-specific bit to the system. The source and the target and memmapped and so are treated like memory blocks. As mentioned, there has intentionally been no effort put into optmimization of any kind. Since this problem was designed to serve as a test subject for a larger system for problem-solving I needed a worst-case baseline to use as a foundation. The idea is to use the larger problem-solving mechanism to split the work up between processes and even machines on a network to resolve the recipe file in record time.
To make the system just:
cmake .
and then
make
To use the program to break up the Epi picture using the house picture you would:
jeff@jeffhpdev:~/dev/bitmimic$ ./bitmimic -a -r epirecipe.dat -d ./house.JPG -s ./epismall1.jpg
Parameters were:
-a = analyze (as opposed to build)
-r = name of recipe file to generate
-d = data file file to use
-s = source file to analyze
Two other command line args I could have used were:
-l = Open this text file and read each file listed therein. The system would spot check each data file listed and select the one with the longest byte-strings from the source. This is good if you want to check a list of JPGs (in this case) to the find ones that have the longest naturally occurring matches. The result is a smaller recipe file.
-D = root directory of files listed in the file with the -l option. For example, all my jpg candidates were in a directory called home/jeff/data; the library file referenced with the -l option was simply the output of ls *.jpg ../library.dat. The whole line would then be:
./bitmimic -a -d <path to library file> -l -D /home/jeff/data/ -r recipe.dat -s epismall1.jpg.
This would iterate through all files listed in the file library.dat, pick 5 spots to check in the source file and see how long the discovered byte-strings were in each file. The file with the largest score would be used as the actual data file.
To then reconstruct the original file you would:
jeff@jeffhpdev:~/dev/bitmimic$ ./bitmimic -b -d ./house.JPG -t epinew.jpg -r epirecipe.dat
-b means to build -t from recipe in -r using -d as a data source.
And there you have it. A few notes about this:
1. In the example the target file (epismall1.jpg) is 54K. The data file of the house is 1.5M.
2. In this exact scenario the analysis bit took just over 8 minutes to process.
3. The generated recipe file is 212K.
Challenges to the system:
There are two primary things that the larger system shall attempt to resolve:
1. Recipe file size. Right now it is just a text of offsets and byte ranges. LOTS could be done to mitigate this, many having nothing to do with parallel processing such as making it a binary file, smarter recording of repeating byte-patterns and so on. However one of the things that could seriously help all of this and still be a result of massive CPU power is finding longer byte strings. Most are in the 2-3 byte range which means more entries and a longer recipe file. By having multiple processes searching on different data files for the longer byte strings could help with this. This may however be at odds with the second thing the larger system will try to resolve:
2. Faster execution. If smaller recipe files were not a concern, having process/machine A break up the search and spread it amongst machines/processes B-n it would obviously accomplish the overall search many times faster.
All of this will require some specialized IPC along with some serious thought which I am applying to the problem right now and will be the subject of future articles. As a side note however, I have just moved to a serious HOT climate and am finding it hard to do more than a few hours per day since our AC is borked. It will happen, just on my own schedule. No one is paying me for this so it gets done when it gets done.
All of that said, it is interesting being able to recreate any piece of data from any other piece of data when you have none of the original data available (in it’s original form that is). I have used that same JPG of my guitar to generate everything from other JPGs to hour-long videos and everything inbetween. However as a computationally intensive problem this is perfect for the larger system which I hope to present soon as it develops/the weather cools/I get the air fixed in our home. It’s going to be a great ride though.
The code here is simple but I encourage people to experiment with it; it will be much much more when it “grows up”.
The code for the sample is here.
Me to you,
JeffC




Related Articles
No user responded in this post
Leave A Reply