Farmer README ============= * Purpose ------- Farmer is an algorithm for discovering frequent queries in multirelational databases. It provides an efficient upgrade of the traditional Apriori frequent item set algorithm to multiple relations. * More information ---------------- More information about Farmer can be found in the following publications: 1. Nijssen, S., Kok, J.N.: Faster Association Rules for Multiple Relations. Proceedings of the International Joint Conference on Artifical Intelligence, 2001, pages 891-897. 2. Nijssen, S., Kok, J.N.: Efficient Frequent Query Discovery in Farmer. Proceedings of the European conference on Principles of Knowledge Discovery and Data Mining, 2003, pages 350-362. 3. Nijssen, S., Kok, J.N.: Proper Refinement of Datalog Clauses using Primary Keys. Proceedings of the Belgium-Netherlands Artificial Intelligence Conference, 2003, pages 227-234. Furthermore information is available from the website http://www.liacs.nl/home/snijssen/farmer The current implementation of Farmer implements all the features mentioned in [2] and [3]. In comparison with the algorithm detailed in [2], Farmer now contains a much better evaluation algorithm which focuses on conflict elamination. In many situations the current implementation of Farmer is two to three times more efficient. * License ------- Farmer is distributed under the GPL. If you use Farmer in a scientific publication, we would appreciate it if you add a reference to publication [2]. of the above list. * Bugs ---- Farmer has been tested on the Carcinogenesis database; its results are consistent with those in other publications. Furthermore, several experiments on hand-made small databases were performed. On larger examples, results were checked by evaluation queries using the GNU Prolog compiler. Still, please report any bug that you encounter. Known bugs (or missing features) are: - Farmer does not deal correctly with the following situation: predicate(p(T,C,C)). mode(p(+,-,+)). The third argument will never "use" the variable introduced by the second argument. There is no mode which can introduce an atom with a new variable that is used in the atom itself. - Farmer does not always deal correctly with the following situation: predicate(e(G,N,N,L)). mode(e(+1,+2,+3,#)). b. mode(e(1,3,2,#)). e. The first mode must be followed by an application of the second, but the atom introduced by the second can be equal to the atom introduced by the first atom. * Authors ------- Siegfried Nijssen (snijssen@liacs.nl) is the main author of Farmer. * Installing ---------- To compile Farmer on any UNIX system, just type 'make' in the directory in which this file resides. * Quick Start ----------- Type ./farmer carci to run Farmer on the PTE Carcinogenesis dataset. * Input of Farmer --------------- Farmer should always be run with one parameter; this parameter indicates the input file of the algorithm. Two kinds of input files are allowed: - a file with the following layout: FILELIST filename1 f filename2 f ... filename3 f If the file starts with the special word FILELIST, Farmer will read the files with names filename1, filename2, ..., filenamen, in that order. FILELIST files can be used in stead of concatenating all files filename1, ..., filenamen in one large file, and inputting that file to Farmer. - a file with, in this order, the following sections: * parameter and bias specification * database facts This file may not start with the keyword FILELIST. The bias is specified using facts of built-in predicates. As soon as a fact is encountered which is not of a built-in predicate, the data specification section starts. Built-in predicates may no longer be used in the data specification section. The data specification section may only contain facts for the predicates defined in the bias. The support for intensional background knowledge is very limited; only "externmode"s can be used in the bias to define intensional background knowledge. The next section lists all built-in predicates. * Built-in predicates ------------------- - "predicate(p(T1,...,Tn))" "predicate" introduces a predicate "p" of which the argument types are, in order, "T1", ..., "Tn". For each predicate, "predicate" may only be used once. - "predicatemax(p,n)" "predicatemax" defines that a predicate "p" may not occur more than "n" number of times in a single query. Farmer will only consider queries with at most "n" atoms of predicate "p". For each predicate, "predicatemax" may only be used once. - "pkey(p(K1,...,Kn))" "pkey" introduces a primary key for predicate "p". All those arguments of "p" for which "Ki" is not equal to "_" (underscore) are considered to be part of the primary key that is defined by the "pkey" statement. For each predicate, "pkey" may be used multiple times to define multiple primary keys. Farmer will not consider queries in which atoms of the same predicate have the same terms for all arguments in one of the primary keys of that predicate. - "nooi(T)" By default, all types are evaluated under Object Identity. "nooi" defines that type "T" should not be evaluated under Object Identity. In some cases, proper refinement is not possible if a type is not evaluated under Object Identity (see [3]). Farmer will issue an error when proper refinement is not possible. - "forbid(T,C)" By default, Farmer considers all constants during the search. "forbid" defines that no queries should be considered which contain constant "C" of type "T". - "showstyle(S)" "showstyle" defines how the queries are outputted by Farmer. "S" should be either "queries", "trie" or "modetrie". In the "queries" style, Farmer outputs each frequent query completely, prefixed by its support and length. In the "trie" and "modetrie" styles each query is represented in a tree as a path from the root. Each node in the tree is suffixed by the support of the corresponding query. In the "modetrie" style mode parameters are filled in in the tree. - "showconstantnames(B)" "showconstantnames(B)" defines whether constants should be represented in the "modetrie" style as a number (in which case "B" must be "false") or should be translated back to the corresponding constant in the database (in which case "B" must be "true"). "showconstantnames(false)" is more efficient, but less instructive. - "minsup(T)" "minsup" defines the minimum support threshold as an absolute integer number "T". - "key(k(-))" "key" defines that "k" is the head atom of every query. In the current implementation, "k" must have one argument. Let "T" be the type of this single argument (as must be defined using a "predicate" fact). Then, in the counting procedure, each constant of type "T" is considered. The number of constants for which the body of a query can be proved, is its support. The argument of "k" in the "key" statement must be "-", possibly followed by an integer. - "mode(p(C1,...,CN))" "mode" defines in what way an atom for predicate "p" may be added to a query during the refinement step of the search algorithm. "Ci" must be either: 1. "+", in which case at argument position "i" the new atom must use a variable that already occurs in the query. 2. "=", in which case at argument position "i" the new atom must use a variable that already occurs in the query, and does not occur at another argument of the new atom. 3. "-", in which case the new atom must have a new variable at argument position "i"; a new variable is a variable that does not already occur in the query. 4. "#", in which case the new atom must contain a constant at argument position "i". 5. An integer, in case the mode occurs in a "b."..."e." block. Mode arguments "+", "-" and "#" may be followed by an integer. - "externmode(f,p(C1,...,CN))" usually, atoms are evaluated using facts in the database. An alternative is to use a function programmed in C++ and compiled into Farmer. An "externmode" defines that an atom generated by this mode should be evaluated using C++ function "f". Currently, in "externmode"s "Ci" may not be "-". Farmer provides two built-in C++ functions: "true" and "is". "true" is a function that always returns "true" and can be used for atoms that should always evaluate to "true". "is" is a function that only returns true if its first argument equals its second argument, and can be used to avoid the addition of facts like "is(a,a)". It should be clear that "externmode"s should be used with very much caution. - "b." ... "e." A "mode" or "key" declaration "M" may be followed by a "b." fact on the next line. The "b." fact denotes the start of a set of modes; one mode in this set is applied during query construction immediately after mode "M" has been applied. The modes are never applied in any other situation. "M" is said to be the parent of the modes within the block. "b." ... "e." blocks may be nested. A mode within a nested "b."..."e." block can therefore have several ancestors. Within a "b."..."e." block integers (see list item 5. in the list of mode parameters)) may be used as mode arguments. Such an integer marker refers to the argument of an ancestor mode (to this purpose, in an ancestor mode an argument can be suffixed with an integer to mark that mode argument). When a mode in a "b." ... "e." block is applied to introduce an atom, for arguments marked with an integer the term is used that occurs in the atom introduced by the ancestor mode. Within a "b."..."e." block no two modes may occur that can possibly generate the same atom. So, the bias mechanism differs in many details from other bias mechanisms such as used by systems like Warmr, Tilde, Progol and Aleph. In Farmer, no maximum numbers are allowed in modes. Numbers in modes can yield a situation in which one query cannot be constructed, but a second query that is equivalent to the first query is constructable. One cannot easily determine which queries out of a set of queries cannot be removed to guarantee completeness of the search. This has been solved using maximum numbers for predicates and using mode blocks. * Bias examples ------------- - Connected, edge labeled trees predicate(e(T,N,N,L)). // T = Tree, T = Node in tree, L = edge label predicate(k(T)). // we are going to count trees pkey(e(T,N,N,_)). // at most one label on each edge key(k(-1)). b. mode(e(1,-,-,#)). // an edge which is not connected to any other // edge may only occur once, and is therefore // added immediately after the head of the // query. e. mode(e(+,+,-,#)). - Edge labeled forests predicate(e(T,N,N,L)). predicate(k(T)). pkey(e(T,N,N,_)). key(k(-)). mode(e(+,-,-,#)). mode(e(+,+,-,#)). - Connected, edge labeled trees with possibly wildcards on edges predicate(e(T,N,N,L)). predicate(isL(L,L)). // is used to compare a label variable with a // a label constant predicate(k(T)). pkey(e(T,N,N,_)). pkey(isL(L,_)). // general property of the "is" function pkey(isL(_,L)). // nooi(L). // in order to be true wildcards, // variables of type L should match any label, // even if a label does occur in the query. key(k(-1)). b. mode(e(1,-,-,-)). e. mode(e(+,+,-,-)). externmode(is,isL(+,#)). // equality check between a label variable and // a constant. No isL facts may occur in the // database. isL is evaluated using the built-in // is function. - Connected, undirected, edge labeled graphs predicate(e(G,N,N,L)). predicate(k(G)). pkey(e(G,N,N,_)). key(k(-1)). b. mode(e(1,-2,-3,#4)). b. mode(e(1,3,2,4)). // to capture the symmetry of edges // in undirected graphs, an edge must // be encoded using two atoms, for each // direction one. // The second atom must always be the reverse // of the first. // Using this mode, after k(G),e(G,N1,N2,a) // as next atom e(G,N2,N1,a) will be introduced. // More efficent is even to define // externmode(true,e(1,3,2,4)). e. e. mode(e(+1,+2,-3,#4)). // to introduce an edge to a new node b. mode(e(1,3,2,4)). e. mode(e(+1,+2,+3,#4)). // to link two existing nodes b. mode(e(1,3,2,4)). e. * Optimizing the bias ------------------- Farmer performs better on some biases: - modes in which the order of arguments is "+" of key type, "#", "+" of other types, "-" are evaluated more efficiently than other mode orders. For example: predicate(e(G,L,N,N)). mode(e(+,#,+,-)). is more efficient than: predicate(e(G,N,N,L)). mode(e(+,-,+,#)). - modes which are known to introduce atoms that are (almost) always true, should be as deep in a nested "b."..."e." tree as possible. - "externmode(true,e(...))" is more efficient than evaluating "e" using the database. These guidelines are illustrated in the Carcinogenesis example.