This document describes the key technical concepts in the Karmasphere system.
Clients make reputation queries to Karmasphere about one or more identities (e.g. IP addresses, domain names, URLs). Along with identity information, queries also include a context parameter called a feedset. The feedset contains multiple feeds and rules used by the system to provide a query response.
The response contains a reputation score that can be used to make a decision.
Opinions about identities are provided by publishers as feeds. Karmasphere continually updates those feeds and puts their information into a database. Our special replication technology distributes that data to multiple query servers. Those query servers answer queries from clients over the network.
From the score returned, you can then make the decision that since your respected sources (collapsed into a feedset) say that this source is bad, you probably want to drop the mail.
Of course, this can also go the other way: “Fifty-one sources think that mail from source.com is worth reading;” you can then let this message bypass your usual first level (or more) of filtering.
Now that we've established how the system works, let's go into some more detail about specific concepts.
An identity is something that allows you, the receiver of a communication, to determine who is attempting to contact you.
Identities include things like IP addresses, domain names, or URLs. They can also include SIP addresses, SMS or IM names, etc. Karmasphere can support any identity type. Initially, it is focused and populated mainly with the following:
A query is issued by a Karmasphere client in response to the receipt of something containing an identity -- in our examples, an email.
In general, a query exchange consists of a question and an answer. The question contains:
The answer, or response, contains:
Once the client has received the score generated by the feedset or combination of feedsets you have queried, you can use that information to make a decision about the email (or other communication) you have received. We generally recommend that you drop very low-scoring mail, accept high-scoring mail with minimal further processing, and send the mail in the middle through your current screening process.
A feed is a set of opinions from a publisher. It is the functional equivalent of an rbldnsd zone file: a long list of IP addresses, domain names, or other identifying information.
One of the distinguishing advantages of the Karmasphere architecture is that we can query many different feeds in parallel, very fast, through the mechanism of a feedset. As the system supports large numbers of feeds and feedsets, each user can customize which feeds they'd like to consult, and in what way.
Data providers think in terms of feeds, because they contribute data one feed at a time. If you're more interested in querying data than contributing it, you'll probably think in terms of feedsets.
A large and growing number of data sources exist and are published through Karmasphere. Combined in particular ways, they provide high quality, high confidence scores. A particular combination of feeds is called a feedset. A feedset gathers multiple feeds together and uses rules to determine how the responses from those feeds are to be interpreted. We have defined a small number of starting point feedsets, which Karmasphere maintains. Since we are always on the lookout for new and interesting data sources, the exact composition of feeds in those feedsets may evolve over time.
Advanced users can, of course, define their own feedsets. Many different types of feeds can coexist within a single feedset. Do not be surprised to see a feedset containing both an IP blacklist and a domain name whitelist.
Query responses include a numeric verdict, or score, which falls into the range -1000 to 1000. Zero means that none of the constituent feeds had an opinion about any of the identities in the query. Scores from +300 to +1000 typically mean that the identifier is trusted by the opinon sources you have chosen to query; scores from -300 to -1000 mean the opposite. If you see a score between -300 and +300, this generally means that either your opinion sources had conflicting opinions, or that the sources that did have opinions were not weighted very strongly.
While we generally recommend that you drop low scores, process high ones lightly, and use secondary filtering on intermediate scores, users are welcome to make their own decisions about what to do with various different scores or where to set the limits from drop/accept/filter.
Most feeds are, conceptually, a static file that contains facts or opinions held by the publisher and which is updated regularly. These feeds are stored, ready to answer queries from clients or command-line or web-based tools.
Sometimes, though, a feed needs to be dynamically evaluated. For example, “Does this IP address have SPF records set up?” These sorts of questions cannot be answered by table lookup; they involve live queries. We thus introduce the idea of a virtual feed: virtual feeds run code for each query and perform some computation to return a result. SPF is an example of a virtual feed.
If a feedset containing 42 feeds offers 41 “no” opinions and one “yes” opinion, how do we merge those into a score?
There are, obviously, several possible ways to collapse multiple opinions into a single score. Instead of hardcoding just one way, we use a combiner. A combiner is an algorithm that turn opinions into scores.
When queried, each feed can return a positive or negative opinion. That opinion is an input to a rule in the FeedsetCombiner. Each rule defines an action to take based on the input; for example, you can say things like:
"Good" and "bad" in this context are generally indicated by the numeric value of the score involved. High scores are considered to be "good," and low ones "bad."
Rules are generally evaluated in the order in which they are encountered, that is, top to bottom in the list. This means that feedsets should contain their most trusted feeds at the top (because those are more likely to produce "stop immediately" kinds of results) and less trusted feeds, which might require several of them hitting at once to create an actionable verdict, at the bottom. This allows the system to produce good results even faster if your more trusted feeds have opinions about the identity you are querying.
A client is any component that issues queries and receives responses from a Karmasphere query server. Karmasphere provides a number of plugins that turn popular products into clients. If you're interested in participating, you can join our open source project or work with us to help develop new clients or other tools.
Third-party developers may turn their product into a Karmasphere client. We offer client libraries in several languages to make this easy.