Schrattenkalk in Kairo
14/03/08

This post is part of a series. Please also check out the other posts:
Part 1: What is the Web 2.0
Part 2: The challenge
Part 3: Inverse footnotes
Part 4: The exculpation of Wikis

Let us assume for now that availability is not a problem – an issue I have not addressed yet. Let us also assume that a part of the ideas sketched in the previous posts have come true. That means that we have an enormous amount of information available. In result we might actually look back at the good old days, when we did not have that much information. OK. To be fair. If availability and accessibility of knowledge as well as the possibility that knowledge is compiled and made available are no problems, this would save us an enormous amount of time. Time we could invest in reading and evaluating more information. However I assume that the amount of information we would gain would outstrip the amount of additional time by far. Therefore relevance becomes a central issue. Solving the relevance issue might be the most difficult of all and I can only present some very vague ideas on where possibilities could be. I think we have to address the question of relevance by two sides. On one side is the question of what information you can put into a system to give you a better evaluation of relevance. On the other side is which systems to evaluate relevance could be available. The two questions as I will show later are in a certain way interlinked. I will look at the second question first, because this where my Web 2.0 analogy might work to a certain degree. The first question implies a rather a user / client side software solution, which I will therefore look at later.

If a majority of texts in humanities would be scanned and made available for a relevance evaluator this could enable number of solutions, most central what could be dubbed as the googlification of humanities. Lawrence Lessig says that the Google Book Search googlified books, because it enables the user to search within books like in web pages. More importantly in humanities a simple relevance rating could be based on the same idea as the Google search – or the recitation index in natural sciences for that matter. The more a webpage is linked to, the more a book is referenced, the more relevant it is. Now Google goes a step further and is using a formula where this is one of many ingredients. The precise formula is secret in the case of Google, but for humanities the formula would have to look a little differently anyway. If looking for a number of terms the search system should consider those hits as more relevant which have one more multiple of the search terms in the title or the relevant chapter title. Secondly references from books which also include my search terms should be valued higher than other references. Thirdly, new books should be valued higher because they naturally have less references to them, as books are not as flexible as the web. Etc. The precise formula might look different in various subjects and fields. Different individuals might prefer different sites. However the search terms itself might look completely different as I will argue in the second part of this post.

Nevertheless this is for the moment only a dream. The more or less complete digitisation of literature in humanities is not even in sight. Therefore other systems might become more relevant. Such a system might be based on user generated tagging and rating. I take these two things together because as I believe, they only make sense in humanities when taken together into one system. Ratings first became relevant with Amazon which is using this to propose you other books people liked who also liked the books you rated high. For this user generated data on Amazon, it is sometimes considered as a very early Web 2.0 company. The second concept was made more widely known by del.icio.us, a service which is practically unknown outside of the Web 2.0 community. They based their idea on the fact that bookmarking has practically failed as it is impossible to keep a reasonable structure for bookmarks. Therefore you do not put a bookmark in a tree like structure but can connect it to a number of chosen terms. based on this del.icio.us can show you similar sites as a specific site you show them. Amazon tried tagging for books, but it basically failed.

My idea is that you can tag a number of books and articles as belonging to one project of yours. Then you can rate these as part of the project. The ratings are not global, but dependant on that project. Furthermore you can give texts general tags and descriptions, which are included in the overall picture. This information is saved in a central database. This system can to a certain degree replace a repository of digitised texts, as it not only gathers written information on texts such as comments and tag, but also puts them into a context of different projects. Combined with those texts which are actually digitalised or made available digitally in the first place anyway, this could supply enough information to have a basic rating of relevance as described above. The formula would look a little different thought. It would take ratings into account, weighed according to the percentage of books in a certain project which include at least one of the search terms. Gathering all this information would be a win in any case as it would make a relevancy judgement more sound even if all books are finally digitised.

But I think the main issue is not the way the results as actually weighed or what information is available to be searched on, but rather the information we give to do the search. If we are looking for a website using one or two search terms is natural. But the idea of using only one search term in humanities is a concept which is basically a result of the paper indices we all used to have in our libraries. However when two different people look for “Liberia” and “economy” there will most probably be two different most relevant search results. To actually find these results however, or to be more correctly, sort these results accordingly, we need to put more information in our search. In humanities compared to normal search we have a big advantage. We know theoretically a lot more precise what a person in looking for and that additional information is again theoretically machine readable. We can simply reuse the same system of tagging and rating. When working on a project we can include texts in it and rate them according to the perceived relevance. When we finally start working on a written output the system can actually take over a part of this, as a we will quote or reference a more relevant text more often than another one. Through this we can actually extend our search term by giving it a certain context. As a result the relevance ratings for one text will be different for two people, or actually for the same person in the case of two different contexts, that is projects. The interesting thing here is, that with all time you put into your text and your references, with every text you quote, the system learns more about your project and can give you better and more detailed recommendations for other texts.

A system like this should be based on a program you can also use to manage your literature and footnotes, such as an extended version of Delicious Library or Endnote. That way the system would be self contained. Users would enter their data out of their own interest to organise their projects. They would thanks to that also get better search results and relevance ratings. The system does have the advantage that it includes texts that are not digitised. Therefore there is no effect as online where “a website which is not on Google does not exist”. But still of course these relevance ratings are only a help. They will never replace the judgement of the researcher himself. So to say, this would be a step towards a Computer Assisted Humanities.

However my major problem with this proposals is their complexity. I do not really see any simple light weight solution which could take over certain tasks t least in an intermediary situation. That of course means that the risk of actually getting to the point of having too many texts and too little help on their relevance is quite big. If you have any good ideas, please do not hesitate to propose them in the comments.

letters


Write a comment