A Quick Guide¶

To use Koko, there are three simple steps that you need to know about:

Installing Koko
Running Koko
Writing queries using Koko.

Here, we discuss each of these steps briefly.

Installation¶

Koko is a registered on PyPi and can be simply installed using the following command:

pip install pykoko

Note: The current version of Koko is only compatible with Python 3.

Running Koko¶

To run Koko, we first need to have a basic query to execute. We will discuss the syntax and semantics of Koko queries in the next section, but for now, let’s focus on a simple task.

Which characters from Les Miserables introduce themselves in the book?

We can obtain a copy of the book from http://www.gutenberg.org/files/135/135-0.txt. The next step would be to write a simple query in a query file (let’s call it miserables.koko) as follows.

extract "Ents" x from "135-0.txt" if
        ("My name is" x {0.1})
with threshold 0.0

The query above simply states that we are interested in entities (or "Ents") that follow right after the words My name is from the specified text file. Every name that matches this pattern receives a score of 0.1. The threshold allows us to exclude entities with a low score, but in this case we are interested to extract all the names matching our specified pattern.

Now we can run koko in a python script as follows:

import koko
koko.run('miserables.koko')

and the results would be as follows:

Results:

Entity name                                        Entity score
===============================================================
Thénardier                                         0.200000
Marius                                             0.200000
Marius Pontmercy                                   0.200000
Bienvenu Myriel                                    0.100000
Jean Valjean                                       0.100000
Félix Tholomyès                                    0.100000
Madame Thénardier                                  0.100000
Champmathieu                                       0.100000
Pontmercy                                          0.100000
Gribier                                            0.100000
Lesgle                                             0.100000
Éponine                                            0.100000
Cosette                                            0.100000
Euphrasie                                          0.100000
Eight                                              0.100000

Writing Koko Queries¶

Here, we discuss the expressive power of Koko queries and how Koko queries can be crafted. We start by describing the generic structure of a koko query.

The following shows the structure of a koko query which consists of four main parts.

extract "<type>" x from "<document>" if      # Part 1: specifying the task.
       (<pattern_1> {<score_1>}) or          # Part 2: specifying the patterns
       (<pattern_2> {<score_2>}) or          #         of interest.
       ...
       (<pattern_n> {<score_n>})
with threshold <threshold_value>             # Part 3: specifying the threshold.
excluding                                    # Part 4: specifying patterns for
       (<excluding_pattern_1>) or            #         which the matching entities
       (<excluding_pattern_2>) or            #         should be ignored.
       ...
       (<excluding_pattern_m>)

Part 1 - Specifying the task: This is the first line of the query in which two main parameters are specified: (1) <type> which specifies what types of text spans that are considered for extraction. For instance, we can ask koko to extract 2-grams or noun phrases or entities, and (2) <document> which specifies to the document that the query will be executed on.
Part 2 - Specifying the patterns: This part is where the patterns that describe the desired entities are listed. Note that each pattern is accompanied by a score which determines how important the pattern is. More precisely, each entity will receive the specified score if it matches the pattern. Perhaps the most simple pattern is what words follow or precede an entity (similar to the pattern in the miserables.koko query). We will provide a detailed list of possible pattern in the next section, but here are a few examples (which are self-explanatory):

("My name is " x {0.1})
(x "is the number one country in" {0.5})
(str(x) contains "Cafe" {0.2})

Part 3 - Specifying the threshold: Based on the specified patterns, each entity will receive a total score which represents how well it matches the patterns. The threshold can be used to prune the entities that have received a low score to obtain high-quality results. Any entity with a score smaller than the threshold will be eliminated from the results.
Part 4 - Specifying the excluding patterns: The patterns specifies in this section are used by Koko to exclude entities from the results. Note that these pattern are not accompanied by a score. That’s because that any entity that would match these entities (even once) would be eliminated from the results. For example, we can write a pattern to exclude entities that don’t contain any spaces which essentially removes single words from the final results. Here, is an updated example of our miserables.koko query that uses the excluding patterns:

extract "Ents" x from "135-0.txt" if
        ("My name is" x {0.1})
with threshold 0.0
excluding
        (str(x) matches "^\S*$")    # matches if there are no whitespaces

which results in:

Results:

Entity name                                        Entity score
===============================================================
Marius Pontmercy                                   0.200000
Bienvenu Myriel                                    0.100000
Jean Valjean                                       0.100000
Félix Tholomyès                                    0.100000
Madame Thénardier                                  0.100000

Patterns Supported by Koko¶

Koko supports a wide range of patterns which can be combined to make powerful queries. Here, we provide a description of these patterns and show a few examples from each which, as before, we run on the text from Les Miserables. Note that not all patterns are binary, and thus only a subset can be used as excluding patterns. Here, we also clarify which patterns can be used in the excluding section as well.

Entity name token containment

This pattern allows us to check if an entity contains a sequence of specified tokens. Note that the tokens should appear in the same exact form and order to be considered a match.

# Which characters have the title "Count"?
extract "Ents" x from "135-0.txt" if
        (str(x) contains "Count" {0.1})
with threshold 0.0

Results:

Entity name                                        Entity score
===============================================================
Count                                              0.300000
Count Lynch                                        0.100000
Count Anglès                                       0.100000