Historical story

Computer puzzles manuscripts together

The Cairo Geniza – a collection of Jewish manuscripts – provides a unique glimpse into history between 950 and 1250 AD. Unfortunately, the leaves are scattered in museums and libraries all over the world. Researchers are now trying to bring the fragments back together using a computer.

Discovered around 1800, they are now scattered all over the world:manuscript fragments from the geniza (storage room) of a synagogue in Cairo, Egypt. Because it is customary to burn documents in genizas over time, the manuscripts are very special. The Cairo Geniza (that's what the collection is called) provides a unique glimpse into history between 950 and 1250 AD.

Unfortunately, it is not easy for scientists to study the documents, because they are stored in different libraries. The largest collection of fragments — about 193,000 out of 280,000 pieces — is in Cambridge (England), but there are also large collections in New York (US) and Manchester (England). Fortunately, more and more fragments are being digitized. However, there is still a problem:which fragments belong together and make up a manuscript?

With the computer

Researchers from Tel Aviv University (Israel) and the Friedberg Genizah Project have developed a system called joins can determine; groups of fragments that come from the same document. Using image processing techniques, they analyze a collection of scanned pages and on that basis they always assess whether two fragments belong together.

What makes analyzing difficult, among other things, is that any automatic analysis was not taken into account when scanning. That is, the background is not always the same, the fragments are not necessarily straight, sometimes a ruler is placed in the picture, etc. The photo must therefore be edited before measurements can be taken. You can see that in the left image above:the system first selects the fragment in the photo, straightens it and makes it a black and white image (so that the computer can work with it quickly).

Where are the straight lines?

One of the steps in the analysis is determining the orientation of the lines:is the text straight or slightly skewed and how much? To do this, the system uses the image Hough transform, a commonly used technique to determine the straight lines in an image.

To create the Hough transform, it is first determined for each pixel on which straight lines it could lie (see illustration below).

The possible lines can be described with the formula x*cos(t) + y*sin(t) =R, where R is the length of the normal between the origin and the line in question, and t the angle between the normal and the x-axis. Based on this, you can make a list of R/t for each pixel in the image combinations, where each combination represents a particular line on which the point may lie. If you plot that list (the t on the x-axis and the R on the y-axis), so for each pixel you get a series of points that you can connect. This plot — with a line for each pixel of the image — is called the Hough transform.

The Hough transform maps the straight lines in the photo. A white spot in the plot indicates that there are many pixels that align with a certain R/t -combination. In other words, those pixels are on the same line. And since it's a lot of pixels, it's probably a line that's also clearly visible in the photo.

Read right

The photos of the Cairo Geniza do not contain real straight lines, but the pixels of the letters on a line are always on a line. You can see this in the Hough transform (see below), because if you look closely, you will see ten separate lines at -90° and +90°:they correspond to the ten lines of text that are horizontal on the sheet.

The computer can calculate where those clear lines can be seen, because that is at the t where the variance is highest. For example, the system determines how the lines of text are on paper:for example, is the variance highest at t =45, then the text is rotated at an angle of 45°.

From text to numbers

The orientation of the text matters, because the system uses a projection profile makes the text. Then the pixels per column are added together, horizontally and vertically (see image below). If you create this profile without paying attention to the rotation of the text, the result will not be correct.

Based on the profile, the system measures a number of characteristics of the text, such as the number of lines, line spacing and the height of a line. These are the "physical measurements" in the diagram at the beginning of this article. For handwriting analysis, the system also detects the keypoints of the image; points in the fragment that stand out extra. It uses the SIFT technique for this (see box).

The physical measurements and the keypoints are really nothing but numbers. The manuscript fragment is thus translated into a row of values, which is called the feature vector. A computer can handle this more easily than with a picture.

Teaching

Now we go back to the original goal:to determine whether two fragments belong to the same document. To do this, look at the feature vectors of the two pieces. The more similar they are, the more likely it is that the texts come from one document. They will then have approximately the same font size, line spacing and/or keypoints. But how do you know how similar two feature vectors are, or rather, how does the computer know? In fact, that is a matter of learning.

In the system there is a classifier, a (mathematical) program that uses an input object, such as a feature vector, can determine in which group it belongs. That is, if you have a script fragment, the classifier to which document it belongs. To do this, the program must know how to assess an object; when does something belong to group A (document A) and when not? You learn that from the classifier on with a training set, a collection of fragments of which you know which belong together. The classifier learns with that information what distinguishes one group from another. For example, in the figure below you can see that based on the size of the petal you can know what type of iris you are dealing with.

New pairs

The researchers made a training set of the Cairo Geniza with well-known joins; pairs of fragments that definitely belong together. This taught the classifier to assess when there is a join. When the researchers then entered new fragments in pairs, the classifier do they say or not join were.

The results were mixed. In a test on the collection of one institute, it was correct in eighty percent of cases. However, a test was also done with fragments from different collections, for which the system is especially useful (so that researchers do not have to travel back and forth). Here came the system with nine thousand possible joins, the top two thousand of which were manually inspected. Only twenty-four percent of detected joins turned out to be correct.

Despite the somewhat disappointing results, the study still has about a thousand new joins delivered. That's quite a lot compared to the few thousand experts have found so far. However, the system cannot yet function without manual checks, the recognition score is too low for that. But it is a nice addition and a step in the right direction.