Malware and Security: content centric security

2008/10/08

I was reading a New York Times article titled "Agency’s ’04 Rule Let Banks Pile Up New Debt". It is a pretty damning article on the SEC and describes a quiet decision made by them to allow investment banks to take on more debt than previously allowed under the assumption that the banks were able to manage their risk better with their newfangled computer models. This allowed Bear Stearns (R.I.P.) to raise it's leverage ratio to 33:1, which seems extraordinarily high. Anyway, while reading it I stumbled over this paragraph:

A lone dissenter — a software consultant and expert on risk management — weighed in from Indiana with a two-page letter to warn the commission that the move was a grave mistake. He never heard back from Washington.

The software consultant was Leonard D. Bole, of Valparaiso, Ind. and he was expressing doubts that computer models could protect companies seeing that they had failed to do so in the collapse of a hedge-fund in 1998 and the market plunge in 1987. While I have my doubts that any computer model can calculate risk well enough and certainly increasing allowed leverage ratios seems just plain daft, I think the current credit crisis is now just down to trust. Or the lack of it.

So, if it is a trust problem, how would a computer scientist approach the problem? First of all, I need to point out that trust is really a human issue, so there is a limit to how much computers can help, just as I doubt we can model risk. However, one of the problems is that there is a certain degree of mortgages that are of too high risk, but banks don't know what their exact exposure is, let alone that of their competitors. The result is that no one trusts each other and the capital market has suffered a form of seizure or heart attack.

A couple of years I was leading a project exploring Data Centric Security and as a part of my research I looked into provenance. We never had time to weave it into the model properly, but identified it as an important aspect that eventually needed to be included. But, wait. What is provenance?

Take paper. Paper documents have great provenance. You fill out a form, hand it in. It gets handled, gets coffee stains over it, stapled to other documents, stamped, filed, refiled, etc. By examining a paper document you get a feeling for where that document has been and what it went through. That is provenance.

Unfortunately, electronic documents don't have provenance out of the box. Luckily, there has been some research into how provenance can be added. The project I was exposed to at IBM Research was the EU Provenance Project that was a part of the European Commission's Sixth Framework Programme, bless their cotton socks. Their proposed architecture, if I remember correctly, was to place hooks in document processing which record document use (CRUD operations: create, read, update, delete). Though I'm not sure if that is the way I would have done it, it certainly work work unless someone cheated or didn't implement the hooks, though I assume that would be uncovered the next time the provenance recording system saw the document.

How would provenance help in the credit crisis? If we just isolate the problem of sub-prime mortgages (and my brother, who knows much more about the financial industry assures me that there are a whole pile of other problems) it does look like a provenance problem to me. From my perspective as an outsider, what seemed to be happening was that these sub-prime mortgages were being sold, repackaged with other debt, sold again and so on. In the end, the last one in the chain didn't know what he/she was actually getting. The lack of provenance of these aggregate debt packages meant it wasn't possible to sufficiently well calculate what the risk was (in itself a dubious thing, but made even more difficult in this case.)

Remember that all financial instruments is really just a document of sorts that we attach a value to. The document has no intrinsic value. Take currency: The dollar bill has no real value. You can't eat it. It doesn't produce a lot of energy when burned. However, we place a certain amount of trust in it as the intricate design and the type of paper tells us that it comes from a trusted source: in this case the US Treasury. The provenance of this bill allows us to accept that the risk is low that the extrinsic value is not one dollar, US.

When aggregating debt from multiple sources you need to collect the provenance of all the included debt documents. This allows you to better estimate the risk associated with the aggregate debt and also find inconsistencies that I really really hope dont exist like circular provenance (which would be similar to a Ponzi scheme.) It also would allow the banks to identify the bad parts of the debts and calculate their exposure, which is something that they don't seem to be able to do at the moment. If they could, they would probably find that the bad debt they own is not as bad as it could be and there would be less uncertainty. Amongst other things, it is the uncertainty about the exposure to bad debt that has resulted in the credit crisis.

While not all the problems that banks are facing can be solved by computer scientists or mathematicians, and you can argue that we have been instrumental in getting us into this mess, provenance standards for financial documents would go a long way to alleviating the problems we have at the moment.

2008/06/03

Semantic Web Meetup June 1, 2008

What would you do on a day of perfect weather in New York City? Attend an all-day code camp on Semantic Web programming in Brooklyn of course! OK, I guess I would have preferred it to be a rainy day if I had to be inside, but it still was worth it. I learned a lot about Semantic Web programming and more importantly, realized that the technology is closer to being reality than before. This is a report on what I learned.
The event was organized by Marco Neumann and hosted by Breck Baldwin of Alias-i. After bagles, that Marco had brought along, and introductions, a brief run-down of some of the concepts and technologies was given. This was followed by quick descriptions of the projects we were to tackle at the meetup. After a rather late lunch we chose our projects and had a few hours to complete them. In theory, we were supposed to use the Extreme Programming paradigm, but that devolved a bit into group programming interspersed with discussion.

I don't really want to go into the projects in detail. I was interested in two of them, the Natural Language Processing project headed by Breck, our host, and a spatial reasoning projected headed by Marco. The actual projects were not that important really, though, instead the programming aspects were. I was at a disadvantage as it turned out that Java is king when it comes to semantic web programming and I've been doing my programming in Ruby and Erlang for over a year. Semantic Web support for Ruby is not great and not really existent on Erlang.

In Java, the way to go was to use the Jena library. Jena started at HP, but in the intermediate time had become a sourceforge project. It now offers support for RDF, RDFS, OWL and SPARQL. It also supports reading and writing the RDF in RDF-XML, N3, N-Triple and I believe also Turtle. There was some discussion of the strengths and weaknesses of these formats. The rough consensus was that N3 and N-Triple are more human readable, but RDF-XML is more expressive, at least from a syntactical standpoint. It wasn't clear to me if there was any semantic difference. In the NLP project, Jena was used to emit RDF, initially in N3 format, though that was quickly changed to RDF-XML. Once that was done for a subset of the data, a SPARQL query was hacked together (again using Jena) that used that file. All in all, it required not that much real code, though given that it was Java there was all sorts of fluff surrounding it.

On a side note, one of the participants showed us some of his Groovy code, and I must say that Groovy might get me back in Java again. It's like a less wordy version of Java, or perhaps a Java that has been put on a diet by the Ruby camp. When Groovy is mentioned, I guess you have to mention Scala as well. Both seem to be taking Java beyond the confines of the actual language, Java, by leveraging Java, the virtual machine and all the libraries that are available as Jars.

Apart from the programming, there were a few other things I picked up. In the past I had been using Protegé. However, apparently this is no longer the way to go. A company called TopBraid Composer, which is based on the Eclipse platform and Jena has usurped Protegé from its throne. Apparently it is free for non-commercial use, though that is unclear from the website as it does say that you need to purchase a license after 30 days.

One of the other projects was looking at transforming a relational database into an RDF database using D2RQ. There was a paper at W3 that describes this idea. From what I gather, this is nearly equal to trying to derive semantics from database schemata - not something that can really be mechanized. There are also all sorts of performance issues that will have to be addressed if a production database were to be stored as an RDF database, but perhaps this is too early to discuss those issues as we first need to understand why we need them this at all. If it means that we can elevate the data in a database to the level of information, this might be worth it, though. Since there seem to be all sorts of expressivity issues when comparing traditional databases to RDF stores, perhaps the right thing would be to develop new application based on RDF first and only then try to transform existing databases.

Another subject that came up was the difference between ABox and TBox reasoning. ABox reasoning is based on assertions on individuals (ie, the rows of data, to use a database table analogy) whereas TBox reasoning is based on concepts (ie, the schema of a database table).

So, what does all this have to do with security? There are two aspects of this.

The security of the Semantic Web metadata
Using Semantic Web technology to secure our data

The first aspect is certainly not a trivial one. Metadata has already caused embarrassment to many people including Tony Blair when people don't realize that there is more data in a typical document than (literally) meets the eye. In computer forensics, this is what we live for. However, as more webpages get semantic data attached to them, more data may be transmitted than gets shown to the user and now it can be read automatically. Privacy advocates will be all over this problem, but corporations will have to pay attention, too.

However, what I am more interested in is the use of this metadata and the technologies of the semantic web to define and enforce security. At IBM, this is called Data-Centric Security and as far as I can tell, they are working on database security using taxonomies for classification. What the NLP projected showed me is that to some degree, we could also create a content based security system at some point in time. Alias-i and OpenCalais might be the key.

What the code camp showed me is that the technology has reached the point that it is usable. While security is nearly never a business case in itself, there will be other, more motivating, reasons to use semantic metadata in corporations and that will enable such ideas as DCS.

Malware and Security

2008/10/08

2008/06/03

Semantic Web Meetup June 1, 2008

About Me

Blog Archive

Labels

Malware and Security

2008/10/08

2008/06/03

Semantic Web Meetup June 1, 2008

About Me

Subscribe To

Blog Archive

Labels