Improving legacy Om code (II): Using effects and coeffects to isolate effectful code from pure code

Introduction.

In the previous post, we applied the humble object pattern idea to avoid having to write end-to-end tests for the interesting logic of a hard to test legacy Om view, and managed to write cheaper unit tests instead. Then, we saw how those unit tests were far from ideal because they were highly coupled to implementation details, and how these problems were caused by a lack of separation of concerns in the code design.

In this post we’ll show a solution to those design problems using effects and coeffects that will make the interesting logic pure and, as such, really easy to test and reason about.

Refactoring to isolate side-effects and side-causes using effects and coeffects.

We refactored the code to isolate side-effects and side-causes from pure logic. This way, not only testing the logic got much easier (the logic would be in pure functions), but also, it made tests less coupled to implementation details. To achieve this we introduced the concepts of coeffects and effects.

The basic idea of the new design was:

  1. Extracting all the needed data from globals (using coeffects for getting application state, getting component state, getting DOM state, etc).
  2. Using pure functions to compute the description of the side effects to be performed (returning effects for updating application state, sending messages, etc) given what was extracted in the previous step (the coeffects).
  3. Performing the side effects described by the effects returned by the called pure functions.

The main difference that the code of horizon.controls.widgets.tree.hierarchy presented after this refactoring was that the event handler functions were moved back into it again, and that they were using the process-all! and extract-all! functions that were used to perform the side-effects described by effects, and extract the values of the side-causes tracked by coeffects, respectively. The event handler functions are shown in the next snippet (to see the whole code click here):

Now all the logic in the companion namespace was comprised of pure functions, with neither asynchronous nor mutating code:

Thus, its tests became much simpler:

Notice how the pure functions receive a map of coeffects already containing all the extracted values they need from the “world” and they return a map with descriptions of the effects. This makes testing really much easier than before, and remove the need to use test doubles.

Notice also how the test code is now around 100 lines shorter. The main reason for this is that the new tests know much less about how the production code is implemented than the previous one. This made possible to remove some tests that, in the previous version of the code, were testing some branches that we were considering reachable when testing implementation details, but when considering the whole behaviour are actually unreachable.

Now let’s see the code that is extracting the values tracked by the coeffects:

which is using several implementations of the Coeffect protocol:

All the coeffects were created using factories to localize in only one place the “shape” of each type of coeffect. This indirection proved very useful when we decided to refactor the code that extracts the value of each coeffect to substitute its initial implementation as a conditional to its current implementation using polymorphism with a protocol.

These are the coeffects factories:

Now there was only one place where we needed to test side causes (using test doubles for some of them). These are the tests for extracting the coeffects values:

A very similar code is processing the side-effects described by effects:

which uses different effects implementing the Effect protocol:

that are created with the following factories:

Finally, these are the tests for processing the effects:

Summary.

We have seen how by using the concept of effects and coeffects, we were able to refactor our code to get a new design that isolates the effectful code from the pure code. This made testing our most interesting logic really easy because it became comprised of only pure functions.

The basic idea of the new design was:

  1. Extracting all the needed data from globals (using coeffects for getting application state, getting component state, getting DOM state, etc).
  2. Computing in pure functions the description of the side effects to be performed (returning effects for updating application state, sending messages, etc) given what it was extracted in the previous step (the coeffects).
  3. Performing the side effects described by the effects returned by the called pure functions.

Since the time we did this refactoring, we have decided to go deeper in this way of designing code and we’re implementing a full effects & coeffects system inspired by re-frame.

Acknowledgements.

Many thanks to Francesc Guillén, Daniel Ojeda, André Stylianos Ramos, Ricard Osorio, Ángel Rojo, Antonio de la Torre, Fran Reyes, Miguel Ángel Viera and Manuel Tordesillas for giving me great feedback to improve this post and for all the interesting conversations.

Permalink

Improving legacy Om code (I): Adding a test harness

Introduction.

I’m working at GreenPowerMonitor as part of a team developing a challenging SPA to monitor and manage renewable energy portfolios using ClojureScript. It’s a two years old Om application which contains a lot of legacy code. When I say legacy, I’m using Michael Feathers’ definition of legacy code as code without tests. This definition views legacy code from the perspective of code being difficult to evolve because of a lack of automated regression tests.

The legacy (untested) Om code.

Recently I had to face one of these legacy parts when I had to fix some bugs in the user interface that was presenting all the devices of a given energy facility in a hierarchy tree (devices might be comprised of other devices). This is the original legacy view code:

This code contains not only the layout of several components but also the logic to both conditionally render some parts of them and to respond to user interactions. This interesting logic is full of asynchronous and effectful code that is reading and updating the state of the components, extracting information from the DOM itself and reading and updating the global application state. All this makes this code very hard to test.

Humble Object pattern.

It’s very difficult to make component tests for non-component code like the one in this namespace, which makes writing end-to-end tests look like the only option.

However, following the idea of the humble object pattern, we might reduce the untested code to just the layout of the view. The humble object can be used when a code is too closely coupled to its environment to make it testable. To apply it, the interesting logic is extracted into a separate easy-to-test component that is decoupled from its environment.

In this case we extracted the interesting logic to a separate namespace, where we thoroughly tested it. With this we avoided writing the slower and more fragile end-to-end tests.

We wrote the tests using the test-doubles library (I’ve talked about it in a recent post) and some home-made tools that help testing asynchronous code based on core.async.

This is the logic we extracted:

and these are the tests we wrote for it:

See here how the view looks after this extraction. Using the humble object pattern, we managed to test the most important bits of logic with fast unit tests instead of end-to-end tests.

The real problem was the design.

We could have left the code as it was (in fact we did for a while) but its tests were highly coupled to implementation details and hard to write because its design was far from ideal.

Even though, applying the humble object pattern idea, we had separated the important logic from the view, which allowed us to focus on writing tests with more ROI avoiding end-to-end tests, the extracted logic still contained many concerns. It was not only deciding how to interact with the user and what to render, but also mutating and reading state, getting data from global variables and from the DOM and making asynchronous calls. Its effectful parts were not isolated from its pure parts.

This lack of separation of concerns made the code hard to test and hard to reason about, forcing us to use heavy tools: the test-doubles library and our async-test-tools assertion functions to be able to test the code.

Summary.

First, we applied the humble object pattern idea to manage to write unit tests for the interesting logic of a hard to test legacy Om view, instead of having to write more expensive end-to-end tests.

Then, we saw how those unit tests were far from ideal because they were highly coupled to implementation details, and how these problems were caused by a lack of separation of concerns in the code design.

Next.

In the next post we’ll solve the lack of separation of concerns by using effects and coeffects to isolate the logic that decides how to interact with the user from all the effectful code. This new design will make the interesting logic pure and, as such, really easy to test and reason about.

Permalink

neo4j-clj: a new Neo4j library for Clojure

On designing a ‘simple’ interface to the Neo4j graph database

While creating a platform where humans and AI collaborate to detect and mitigate cybersecurity threats at CYPP, we chose to use Clojure and Neo4j as part of our tech stack. To do so, we created a new driver library (around the Java Neo4j driver), following the clojuresque way of making simple things easy. And we chose to share it, to co-develop it under the Gorillalabs organization. Follow along to understand our motivation, get to know our design decisions, and see examples. If you choose a similar tech stack, this should give you a head start.

https://medium.com/media/1f3f622f0ec90f78b29d04f617cbdc88/href

Who we are

Gorillalabs is a developer-centric organization (not a Company) dedicated to Open Source Software development, mainly in Clojure.

I (@Chris_Betz on Twitter, @chrisbetz on Github) created Gorillalabs to host Sparkling, a Clojure library for Apache Spark. Coworkers joined in, and now Gorillalabs brings together people and code from different employers to create a neutral collaboration platform. I work at CYPP, simplifying cybersecurity for mid-sized companies.

Most of Gorillalabs projects stem from the urge to use the best tools available for a job and make them work in our environment. That’s the fundamental idea and the start of our organization. And for our project at CYPP, using Clojure and Neo4j was the best fit.

Why Clojure?

I started using Common LISP in the 90ies, moved to Java development for a living, and switched to using Clojure in production in 2011 as a good synthesis of the two worlds. And, while constantly switching roles from designing and developing software to managing software development back and forth, I specialized in delivering research-heavy projects.

For many of those projects, Clojure has two nice properties: First, it comes with a set of immutable data structures (reducing errors a lot, making it easier to evolve the domain model). And second, with the combination of ClojureScript and Clojure, you can truly use one language in backend and frontend code. Although you need to understand different concepts on both ends, with your tooling staying the same, it is easier to develop vertical (or feature) slices instead of horizontal layers. Check out my EuroClojure 2017 talk on that, if you’re interested.

https://medium.com/media/b2fa5069e95ef56312be270630318e91/href

Graphs are everywhere — so make use of them

For threat hunting, i.e. the process of detecting cybersecurity threats in an organisation, graphs are a natural data modelling tool. The most obvious graph is the one where computers are connected through TCP/IP connections. You can find malicious behaviour if one of your computers shows unwanted connections. (Examples are over-simplified here.)

But that’s just the 30.000-feet view. In fact, connections are between processes running on computers. And you see malicious behaviour if a process binds to an unusual port.

Processes are running with a certain set of privileges defined by the “user” running the process. Again, it’s suspicious if a user who should be unprivileged started a process listening for an inbound connection.

You get the point: Graphs are everywhere, and they help us cope with threats in a networked world.

Throughout our quest for the best solution around, we experimented with other databases and query languages, but we came to Neo4j and Cypher. First, it’s a production quality database solution, and second, it has a query language you really can use. We used TinkerPop/Gremlin before, but found it not easy to use for simple things, and really hard for complex queries.

Why we created a new driver

There’s already a Neo4j driver for Clojure. There’s even an example project on the Neo4j website. What on earth were we thinking creating our own Neo4j driver?

Neo4j introduced Bolt on Neo4j 3.x as the new protocol to interact with Neo4j. It made immediate sense, however, neocons did not pick it up, at least not at the pace we needed. Instead, it seemed as if the project lost traction, having had only very few contributions for a long time. So we needed to decide whether we should fork neocons to move it to Neo4j 3.x or not.

However, with bolt and the new Neo4j Java Driver, we would have implemented a second, parallel implementation of the driver. That was the point where we decided to go all the way building a new driver: neo4j-clj was born.

Design choices and code examples

Creating a new driver gave us the opportunity to fit it exactly to our needs and desires. We made choices you might like or disagree with, but you should know why we made them.

If you want to follow the examples below, you need to have a Neo4j instance up and running.

Then, you just need to know one namespace alias for neo4j-clj.core and one connection to your test database (also named db):

https://medium.com/media/9c96602b357f21077c3bdd11434a44ec/href

Using “raw” Cypher

The most obvious thing is our choice to keep “raw” Cypher queries as strings, but to be able to use them as Clojure functions. The idea to this is actually not new and not our own, but borrowed from yesql. Doing so, you do not bend one language (Cypher) into another (Clojure), but keep each language for the problems its designed for. And, as a bonus, you can easily copy code over from one tool (code editor) to another (Neo4j browser), or use plugins to your IDE to query a database with the Cypher queries from your code.

So, to create a function wrapping a Cypher query, you just wrap that Cypher string in a defquery macro like this:

https://medium.com/media/6be448981f84ee0c8b2fe7677a81039c/href

And, you can easily copy the string into your Neo4j browser or any other tool to check the query, profile it, whatever you feel necessary.

With this, you can easily run the query like this:

https://medium.com/media/b9ec287db031815c28f26eed8377db4e/href

and, depending on the data in your test database, will end up with a sequence of maps representing your hosts. For me, it’s something like this:

https://medium.com/media/75cd32512d9dd34ac2a396d757c9e30e/href

This style makes it more clear that you should not be constructing queries on the fly, but use a defined set of queries in your codebase. If you need a new query, define one specifically for that purpose. Think about which indices you need, how this query performs best, reads best, you name it.

However, this decision has some drawbacks. There’s no compiler support, no IDE check, as Cypher queries are not recognized as such. They are just strings. However, there’s not much Cypher support in IDEs anyhow. That’s different than with yesql, where you usually have SQL linting with appropriate files.

Each query function will return a list. Even if it’s empty. There’s no convenience function for creating queries to get a single object (for something like host-by-id). If you know there's only one, pick it using first.

Relying on the Java driver, but working with Clojure data structures

We just make use of the Java driver, so basically, neo4j-clj is only a thin wrapper. However, we wanted to be able to live in the Clojure world as much as possible. To us, that meant we need to interact with Neo4j using Clojure data structures. You saw that in the first example, where a query function returns a list of maps.

However, you can also parameterize your queries using maps:

https://medium.com/media/dbc68f61a42865756d5291929834b072/href

This example is more complex than necessary just to make a point clear: You can destructure Clojure maps {:host {:id "..."}} by navigating them in Cypher $host.id.

Nice thing is, you can easily test these queries in the Neo4j browser if you set the parameters correct:

https://medium.com/media/0fe02deb113c99e27c97581a3b85d1f6/href

Joplin integration built-in

We’re fans of having seeding and migration code for the database in our version control. Thus, we use Joplin and we suggest, you do, too. That’s why we built Joplin support right into neo4j-clj.

With Joplin, you can write migrations and seed functions to populate your database. This isn’t as important in Neo4j as it is in relational databases, but it’s necessary, e.g. for index or constraint generation.

First, Joplin migrates your database, if it isn’t at latest stage (path-to-joplin-neo4j-migrators points to a folder of migration files, which are applied in alphabetical order):

https://medium.com/media/c6df5d4905d6adfd0785eb8b1718eac6/href

And each migration file has (at least) the two functions up and down to perform the actual migration. For example:

https://medium.com/media/6513d70dc739ae701864bffeb7bbd8fc/href

Also, you can seed your database from a function like this:

https://medium.com/media/9d33faac226031d7b184dfe71b146e28/href

Now you can seed your database like this. Here, we use a config identical to the one from migration:

https://medium.com/media/c94556770eae4f68c4112efe77075944/href

With this seed function, you see a style we got used to: We prefix all the functions created by defquery with db> and we use the ! suffix to mark functions with side-effects. That way, you see when code leaves your platform and what you can expect to happen.

Tested all the way

Being big fans of testing, we wanted the tests for our driver to be as easy and as fast as possible. You should be able to combine that with a REPL-first approach, where you can experiment on the REPL. Luckily, you can run Neo4j in embedded mode, so we did not need to rely on an existing Neo4j installation or a running docker image of Neo4j. Instead, all our tests run isolated in embedded Neo4j instances. We just needed to make sure not to use the Neo4j embedded API, but the bolt protocol. Easy, my colleague Max Lorenz just bound the embedded Neo4j instance to an open port and connected the driver to that, just as you would do in production.

Using a with-temp-db-fixture, we just create a new session against that embedded database and test the neo4j-clj functions in a round-trip without external requirements. Voilá.

Use it, fork it, blog it

neo4j-clj is ready to be used. We do. We’d love to hear from you (@gorillalabs_de or @chris_betz). Share your experiences with neo4j-clj.

There are still some rough edges: Maybe you need more configuration options. Or support for some other property types, especially the new Date/Time and Geolocation types. We’ll add stuff over time. If you need something specific, please open an issue on Github, or add it yourself and create a Pull Request on the ‘develop’ branch.

We welcome contributions, so feel free to hack right away!


neo4j-clj: a new Neo4j library for Clojure was originally published in neo4j on Medium, where people are continuing the conversation by highlighting and responding to this story.

Permalink

Learning Ring And Building Echo

When you come to Clojure and want to build an web app you'll discover Ring almost immediately. Even if you use another library like Compojure you'll likely find Ring in use.

What is Ring?

As stated in the Ring repo, it is a library that abstracts the details of HTTP into a simple API. It does this by turning HTTP requests into Clojure maps which can be inspected and modified by a handler which returns a HTTP response. The handlers is Clojure functions that you create. You are also responsible for creating the response. Ring connects your handler with the underlying web server and is responsible for taking the requests and calling your handler with the request map.

If you've had experience with Java Servlets you'll notice a pattern here but will quickly see how much simpler this is here.

Requests

Requests typically come from web browsers and can have a number of fields. Requests also have different types (GET, POST, etc), a unique URI with query string, and message body. Ring takes all of this information and converts into a Clojure map.

Here is an example of a request map generated by Ring for a request to http://localhost:3000.

{:ssl-client-cert nil,
 :protocol "HTTP/1.1",
 :remote-addr "0:0:0:0:0:0:0:1",
 :headers
 {"cache-control" "max-age=0",
  "accept"
  "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "upgrade-insecure-requests" "1",
  "connection" "keep-alive",
  "user-agent"
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
  "host" "localhost:3000",
  "accept-encoding" "gzip, deflate, br",
  "accept-language" "en-US,en;q=0.9"},
 :server-port 3000,
 :content-length nil,
 :content-type nil,
 :character-encoding nil,
 :uri "/",
 :server-name "localhost",
 :query-string nil,
 :body "",
 :scheme :http,
 :request-method :get}
 

How did I get this map so I could show it here? From Echo which is the application I'll describe next.

Echo

When learning about Ring you'll here all about the request map and you'll learn over time what fields it typically contains but sometimes you'll want to see it. Also you may want to see what a script or form is posting to another system. In such a situation it can be very handy to point the script or form to a debugging site which shows what they are sending. This is the purpose of Echo.

To state it clearly. Echo is to be a web application that echos back everything that is sent to it. It should take it's request map and format it in such a way that it can be returned to the caller.

Steps

  1. Start a new Clojure project
$ lein new echo
  1. Add Ring dependencies

Add ring-core and ring-jetty-adapter to your project.clj file. Also, add a :main entry to echo.core so you can run your application.

  :dependencies [[org.clojure/clojure "1.8.0"]
                 [ring/ring-core "1.6.3"]
                 [ring/ring-jetty-adapter "1.6.3"]]
  :main echo.core
  1. Create a handler and connect it to your Ring handler

Inside of core.clj' add a handler function and connect it to your adapter. Also add :gen-class and create a -main function so you can run` your application.

Here is the complete core.clj file. Notice that you are requiring ring.adapter.jetty. This is the adapter that represents the Jetty web server. It passes your handler requests.

Here the handler will return a minimal response with the words "Hello from Echo"

(ns echo.core
  (:require [ring.adapter.jetty :as jetty])
  (:gen-class))


(defn handler [request]
  {:status 200
   :header {"Content-Type" "text/plain"}
   :body "Hello from Echo"})


(defn -main []
  (jetty/run-jetty handler {:port 3000}))

At this point you can test your app. From the root of your project enter the following to run it.

$ lein run

Then open a browser to http://localhost:3000/. You should see the following as a response.

Hello from Echo
  1. Modify the handler to return the full request

Next we'll modify the handler to return everything in the request. But, there are a couple things to figure out to make this work. First, you can't just send the request back else you'll get an error as the body of the request needs to be read.

To see what I mean modify your handler so it simply returns the request. The snippet looks like:

:body request

As a next step you might want to pprint the request. You can try this by adding [clojure.pprint :as pprint] to your require clause and then calling pprint on the request and thinking it will go into the body by using the following snippet.

:body (pprint/pprint request)

Try that and watch the terminal where you entered lein run. You'll see the pprint output there.

Now that would be great if it was passed back to the browser. How? By capturing the output of pprint to a string and then passing that string to the browser through the :body field.

(defn handler [request]
  (let [s (with-out-str (pprint/pprint request))]
    {:status 200
     :header {"Content-Type" "text/plain"}
     :body   s
     }))

At this point go through and text with a new request from a browser. Here I see the following:

{:ssl-client-cert nil,

:protocol "HTTP/1.1",
 :remote-addr "0:0:0:0:0:0:0:1",
 :headers
 {"cache-control" "max-age=0",
  "accept"
  "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "upgrade-insecure-requests" "1",
  "connection" "keep-alive",
  "user-agent"
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
  "host" "localhost:3000",
  "accept-encoding" "gzip, deflate, br",
  "accept-language" "en-US,en;q=0.9"},
 :server-port 3000,
 :content-length nil,
 :content-type nil,
 :character-encoding nil,
 :uri "/",
 :server-name "localhost",
 :query-string nil,
 :body
 #object[org.eclipse.jetty.server.HttpInputOverHTTP 0x494e22ea "HttpInputOverHTTP@494e22ea"],
 :scheme :http,
 :request-method :get
 

There is one thing here that isn't problem but will be when you try Echo with a POST from a form. It's the :body field. See how it is a HttpInputOverHTTP. This is something you want to read before sending so it shows up in the response. To do this see this final version of the handler.

(defn handler [request]
  (let [s (with-out-str (pprint/pprint (conj request {:body (slurp (:body request))})))]
    {:status 200
     :header {"Content-Type" "text/plain"}
     :body   s}))

Notice how the :body of the request was read with the slurp function and then the value of the :body field in the request is replaced with conj before being passed to pprint.

With a final test you should see something similar to the following:

{:ssl-client-cert nil,
 :protocol "HTTP/1.1",
 :remote-addr "0:0:0:0:0:0:0:1",
 :headers
 {"cache-control" "max-age=0",
  "accept"
  "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  "upgrade-insecure-requests" "1",
  "connection" "keep-alive",
  "user-agent"
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
  "host" "localhost:3000",
  "accept-encoding" "gzip, deflate, br",
  "accept-language" "en-US,en;q=0.9"},
 :server-port 3000,
 :content-length nil,
 :content-type nil,
 :character-encoding nil,
 :uri "/",
 :server-name "localhost",
 :query-string nil,
 :body "",
 :scheme :http,
 :request-method :get}
 

Last test. Let's try posting something to our Echo.

$ curl --data 'firstname=Bob&lastname=Smith' http://localhost:3000/add/name?v=1

Here I see the following returned.

{:ssl-client-cert nil,
 :protocol "HTTP/1.1",
 :remote-addr "0:0:0:0:0:0:0:1",
 :headers
 {"user-agent" "curl/7.54.0",
  "host" "localhost:3000",
  "accept" "*/*",
  "content-length" "28",
  "content-type" "application/x-www-form-urlencoded"},
 :server-port 3000,
 :content-length 28,
 :content-type "application/x-www-form-urlencoded",
 :character-encoding nil,
 :uri "/add/name",
 :server-name "localhost",
 :query-string "v=1",
 :body "firstname=Bob&lastname=Smith",
 :scheme :http,
 :request-method :post}
 

Notice the body as well as the uri and query-string values. These are all pulled out of the request by Ring and as such you can build handlers to look for values in these fields and respond accordingly.

Summation

At this point you've created a very basic Ring application. Hopefully, it's one that you can use in the future to help debug other applications. Also, now that you see what fields are in requests you can build up from here.

Source code for the example Echo application is available at https://github.com/bradlucas/echo/tree/release/1.0.0

Permalink

Glad you liked it!

Glad you liked it! :)
I’d be more than happy to have you translate it, do send me a link afterwards (I’m also learning Japanese in my free time xD).

I’m actually curious about the Clojure ecosystem in Japan too, are there any companies using it? What do you think?

Permalink

Shelving; Building a Datalog for fun! and profit?

This post is my speaker notes from my May 3 SF Clojure talk (video) on building Shelving, a toy Datalog implementation


Databases?

  • 1660s; the first data stores. These early systems were very tightly coupled to the physical representation of data on tape. Difficult to use, develop, query and evolve.
    • CODASYL, a set of COBOL patterns for building data stores; essentially using doubly linked lists
    • IBM's IMS which had a notion of hierarchical (nested) records and transactions
  • 1969; E. F. Codd presents the "relational data model"

Relational data

Consider a bunch of data

[{:type ::developer
  :name {:given "Reid"
         :family "McKenzie"
         :middle "douglas"
         :signature "Reid Douglas McKenzie"}
  :nicknames ["arrdem" "wayde"]}
 {:type ::developer
  :name {:given "Edsger"
         :family "Djikstra"}
  :nicknames ["ewd"]}]

In a traditional (pre-relational) data model, you could imagine laying out a C-style struct in memory, where the name structure is mashed into the developer structure at known byte offsets from the start of the record. Or perhaps the developer structure references a name by its tape offset and has a length tagged array of nicknames trailing behind it.

The core insight of the relational data model is that we can define "joins" between data structures. But we need to take a couple steps here first.

Remember that maps are sequences of keys and values. So to take one of the examples above,

{:type ::developer
 :name {:given "Edsger"
        :family "Djikstra"}
 :nicknames ["ewd"]}

;; <=> under maps are k/v sequences

[[:type ::developer]
 [:name [[:given "Edsger"]
         [:family "Djikstra"]]]
 [:nicknames ["ewd"]]]

;; <=> under kv -> relational tuple decomp.

[[_0 :type ::developer]
 [_0 :name _1]
 [_0 :nickname "ewd"]
 [_1 :given "Edsger"]
 [_1 :family "Djikstra"]]

We can also project maps to tagged tuples and back if we have some agreement on the order of the fields.

{:type ::demo1
 :foo 1
 :bar 2}

;; <=>

[::demo1 1 2] ;; under {0 :type 1 :foo 2 :bar}

Finally, having projected maps (records) to tuples, we can display many tuples as a table where columns are tuple entries and rows are whole tuples. I mention this only for completeness, as rows and columns are common terms of use and I want to be complete here.

foo bar
1 2
3 4

Okay so we've got some data isomorphisms. What of it?

Well the relational algebra is defined in terms of ordered, untagged tuples.

Traditionally data stores didn't include their field identifiers in the storage implementation as an obvious space optimization.

That's it. That's the relational data model - projecting flat structures to relatable tuple units.

Operating with Tuples

The relational algebra defines a couple operations on tuples, or to be more precise sets of tuples. There are your obvious set theoretic operators - union, intersection and difference, and there's three more.

cartesian product

let R, S be tuple sets ∀r∈R,∀s∈S, r+s ∈ RxS

Ex. {(1,) (2,)} x {(3,) (4,)} => {(1, 3,) (1, 4,) (2, 3,) (2, 4,)}

projection (select keys)

Projection is bettern known as select-keys. It's an operator for selecting some subset of all the tuples in a tuple space. For instance if we have R defined as

A B C
a b c
d a f
c b d

π₍a,b₎(R) would be the space of tuples from R excluding the C column -

A B
a b
d a
c b

selection

Where projection selects elements from tuples, selection selects tuples from sets of tuples. I dislike the naming here, but I'm going with the original.

To recycle the example R from above,

A B C
a b c
d a f
c b d

σ₍B=b₎(R) - select where B=b over R would be

A B C
a b c

Joins

Finally given the above operators, we can define the most famous one(s), join and semijoin.

join (R⋈S)

The (natural) join of two tuple sets is the subset of the set RxS where any fields COMMON to both r∈R and s∈S are "equal".

Consider some tables, R

A B C
a b c
d e f

and S,

A D E
a 1 3
d 2 3

We then have R⋈S to be

A B C D E
a b c 1 2
d e f 2 3

semijoin

This is a slightly restricted form of join - you can think of it as the join on some particular column. For instance, if R and S had several overlapping columns, the (natural) join operation joins by all of them. For instance we may want to have several relations between two tables - and consequently leave open the possibility of several different joins.

In general when talking about joins for the rest of this presentation I'll be talking about natural joins over tables designed for only overlapping field so the natural join and the semijoin collapse.

Enter Datalog

Codd's relational calculus as we've gone through is a formulation of how to view data and data storage in terms of timeless, placeless algebraic operations. Like the Lambda Calculus or "Maxwell's Laws of Software" as Kay has described the original Lisp formulation, it provides a convenient generic substrate for building up operational abstractions. It's the basis for entire families of query systems which have come since.

Which is precisely what makes Datalog interesting! Datalog is almost a direct implementation of the relational calculus, along with some insights from logic programming. Unfortunately, this also means that it's difficult to give a precise definiton of what datalog is. Like lisp, it's simple enough that there are decades worth of implementations, re-implementations, experimental features and papers.

Traditionally, Datalog and Prolog share a fair bit of notation so we'll start there.

In traditional Datalog as in Prolog, "facts" are declared with a notation like this. This particular code is in Souffle a Datalog dialect, which happened to have an Emacs mode. This is the example I'll be trying to focus on going forwards.

State("Alaska")
State("Arizona")
State("Arkansas")

City("Juneau", "Alaska")
City("Phoenix", "Arizona")
City("Little Rock", "Arkansas")

Population("Juneau", 2018, 32756)
Population("Pheonix", 2018, 1.615e6)
Population("Little Rock", 2018, 198541)

Capital("Juneau")
Capital("Phoenix")
Capital("Little Rock")

Each one of these lines defines a tuple in the datalog "database". The notation is recognizable from Prolog, and is mostly agreed upon.

Datalog also has rules, also recognizable from logic programming. Rules describe sets of tuples in terms of either other rules or sets of tuples. For instance

CapitalOf(?city, ?state) :- State(?state), City(?city, ?state), Capital(?city).

This is a rule which defines the CapitalOf relation in terms of the State, City and Capital tuple sets. The CapitalOf rule can itself be directly evaluated to produce a set of "solutions" as we'd expect.

?city and ?state are logic variables, the ?- prefix convention being taken from Datomic.

That's really all there is to "common" datalog. Rules with set intersection/join semantics.

Extensions

Because Datalog is so minimal (which makes it attractive to implement) it's not particularly useful. Like Scheme, it can be a bit of a hair shirt. Most Datalog implementations have several extensions to the fundimental tuple and rule system.

Recursive rules!

Support for recursive rules is one very interesting extension. Given recursive rules, we could use a recursive Datalog to model network connectivity graphs (1)

Reachable(?s, ?d) :- Link(?s, ?d).
Reachable(?s, ?d) :- Link(?s, ?z), Reachable(?z, ?d).

This rule defines reachability in terms of either there existing a link between two points in a graph, or there existing a link between the source point and some intermediate Z which is recursively reachable to the destination point.

The trouble is that implementing recursive rules efficiently is difficult although possible. Lots of fun research material here!

Negation!

You'll notice that basic Datalog doesn't support negation of any kind, unless "positively" stated in the form of some kind of "not" rule.

TwoHopLink(?s, ?d) :- Link(?s, ?z), Link(?z, ?d), ! Link(?s, ?d).

It's quite common for databases to make the closed world assumption - that is all possible relevant data exists within the database. This sort of makes sense if you think of your tuple database as a subset of the tuples in the world. All it takes is one counter-example to invalidate your query response if suddenly a negated tuple becomes visible.

Incremental queries / differentiability!

Datalog is set-oriented! It doesn't have a concept of deletion or any aggregation operators such as ordering which require realizing an entire result set. This means that it's possible to "differentiate" a Datalog query and evaluate it over a stream of incomming tuples because no possible new tuple (without negation at least) will invalidate the previous result(s).

This creates the possibility of using Datalog to do things like describe application views over incremental update streams.

Eventual consistency / distributed data storage!

Sets form a monoid under merge - no information can ever be lost. This creates the possibility of building distributed data storage and query answering systems which are naturally consistent and don't have the locking / transaction ordering problems of traditional place oriented data stores.

The Yak

Okay. So I went and build a Datalog.

Why? Because I wanted to store documentation, and other data.

95 Theses

Who's ready for my resident malcontent bit?

Grimoire

Grimoire has a custom backing data store - lib-grimoire - which provides a pretty good model for talking about Clojure and ClojureScript's code structure and documentation.

https://github.com/clojure-grimoire/lib-grimoire#things

lib-grimoire was originally designed to abstract over concrete storage implementations, making it possible to build tools which generated or consumed Grimoire data stores. And that purpose it has served admirably for me. Unfortunately looking at my experiences onboarding contributors it's clearly been a stumbling block and the current Grimoire codebase doesn't respect the storage layer abstraction; there are lots of places where Grimoire makes assumptions about how the backing store is structured because I've only ever had one.

Grenada

https://github.com/clj-grenada/grenada-spec

In 2015 I helped mentor Richard Moehn on his Grenada project. The idea with the project was to take a broad view of the Clojure ecosystem and try to develop a "documentation as data" convention which could be used to pack documentation, examples and other content separately from source code - and particularly to enable 3rdparty documenters like myself to create packages for artifacts we don't control (core, contrib libraries). The data format Richard came up with never caught on I think because the scope of the project was just the data format not developing a suite of tools to consume it.

What was interesting about Grenada is that it tried to talk about schemas, and provide a general framework for talking about the annotations provided in a single piece of metadata rather than relying on a hard-coded schema the way Grimoire did.

cljdoc

https://github.com/martinklepsch/cljdoc

In talking to Martin about cljdoc and some other next generation tools, the concept of docs as data has re-surfaced again. Core's documentation remains utterly atrocious, and a consistent gripe of the community yearly survey over survey.

Documentation for core is higher hit rate than documentation for any other single library, so documenting core and some parts of contrib is a good way to get traction and add value for a new tool or suite thereof.

Prior art

You can bolt persistence ala carte onto most of the above with transit or just use edn, but then your serialization isn't incremental at all.

Building things is fun!

Design goals

  • Must lend itself to some sort of "merge" of many stores
    • Point reads
    • Keyspace scans
  • Must have a self-descriptive schema which is sane under merges / overlays
  • Must be built atop a meaningful storage abstraction
  • Design for embedding inside applications first, no server

Building a Datalog

Storage models!

Okay lets settle on an example that we can get right and refine some.

Take a step back - Datalog is really all about sets, and relating a set of sets of tuples to itself. What's the simplest possible implementation of a set that can work? An append only write log!

[[:state "Alaska"]
 [:state "Arizona"]
 [:state "Arkansas"]
 [:city "Juneau" "Alaska"]
 [:city "Pheonix" "Arizona"]
 ...]

Scans are easy - you just iterate the entire thing.

Writes are easy - you just append to one end of the entire thing.

Upserts don't exist, because we have set semantics so either you insert a straight duplicate which doesn't violate set semantics or you add a new element.

Reads are a bit of a mess, because you have to do a whole scan, but that's tolerable. Correct is more important than performant for a first pass!

Schemas!

So this sort of "sequence of tuples" thing is how core.logic.pldb works. It maintains a map of sequences of tuples, keyed by the tuple "name" so that scans can at least be restricted to single tuple "spaces".

Anyone here think that truely unstructured data is a good thing?

Yeah I didn't think so.

Years ago I did a project - spitfire - based on pldb. It was a sketch at a game engine which would load data files for a the Warmachine table top game pieces and provide with a rules quick reference and ultimately I hoped a full simulation to play against.

As with most tabletop war games, play proceeds by executing a clock, and repeatedly consulting tables of properties describing each model. Which we recognize as database query.

Spitfire used pldb to try and solve the data query problem, and I found that it was quite awkward to write to in large part because it was really easy to mess up the tuples you put into pldb. There was no schema system to save you if you messed up your column count somewhere. I built one, but its ergonomics weren't great.

Since then, we got clojure.spec(.alpha) which enables us to talk about the shape and requirements on data structures. Spec is designed for talking about data in a forwards compatible way, unlike traditional type systems which intentionally introduce brittleness to enable evolution.

While this may or may not be an appropriate trade-off for application development, it's a pretty great trade-off for persisted data and schemas on persisted, iterated data!

https://github.com/arrdem/shelving#schemas

(s/def :demo/name string?)
(s/def :demo.state/type #{:demo/state})
(s/def :demo/state
  (s/keys :req-un [:demo/name
                   :demo.state/type]))

(defn ->state [name]
  {:type :demo/state, :name name})

(s/def :demo/state string?)
(s/def :demo.city/type #{:demo/city})
(s/def :demo/city
  (s/keys :req-un [:demo.city/type
                   :demo/name
                   :demo/state]))

(defn ->city [state name]
  {:type :demo/city, :name name, :state state})

(s/def :demo/name string?)
(s/def :demo.capital/type #{:demo/capital})
(s/def :demo/capital
  (s/keys :req-un [:demo.capital/type
                   :demo/name]))

(defn ->capital [name]
  {:type :demo/capital, :name name})

(def *schema
  (-> sh/empty-schema
      (sh/value-spec :demo/state)
      (sh/value-spec :demo/city)
      (sh/value-spec :demo/capital)
      (sh/automatic-rels true))) ;; lazy demo

Writing!

#'shelving.core/put!

  • Recursively walk spec structure
    • depth first
    • spec s/conform equivalent
  • Generate content hashes for every tuple
  • Recursively insert every tuple (skipping dupes)
  • Insert the topmost parent record with either with a content hash ID or a generated ID depending on record/value semantics.
  • Create schema entries in the db if automatic schemas are on and the target schema/spec doesn't exist in the db.

Okay so lets throw some data in -

(def *conn
  (sh/open
   (->MapShelf *schema "/tmp/demo.edn"
               :load false
               :flush-after-write false)))
;; => #'*conn

(let [% *conn]
  (doseq [c [(->city "Alaska" "Juneau")
             (->city "Arizona" "Pheonix")
             (->city "Arkansas" "Little Rock")]]
    (sh/put-spec % :demo/city c))

  (doseq [c [(->capital "Juneau")
             (->capital "Pheonix")
             (->capital "Little Rock")]]
    (sh/put-spec % :demo/capital c))

  (doseq [s [(->state "Alaska")
             (->state "Arizona")
             (->state "Arkansas")]]
    (sh/put-spec % :demo/state s))

  nil)
;; => nil

Schema migrations!

Can be supported automatically, if we're just adding more stuff!

  • Let the user compute the proposed new schema
  • Check compatibility
  • Insert into the backing store if there are no problems

Query parsing!

Shelving does the same thing as most of the other Clojure datalogs and rips off Datomic's datalog DSL.

(sh/q *conn
  '[:find ?state
    :in ?city
    :where [?_0 [:demo/city :demo/name] ?city]
           [?_0 [:demo/city :demo.city/state] ?state]
           [?_1 [:demo/capital :demo/name] ?city]])

This is defined to have the same "meaning" (query evaluation) as

(sh/q *conn
      '{:find  [?state]
        :in    [?city]
        :where [[?_0 [:demo/city :demo/name] ?city]
                [?_0 [:demo/city :demo.city/state] ?state]
                [?_1 [:demo/capital :demo/name] ?city]]})

How can we achieve this? Let alone test it reasonably?

Spec to the rescue once again! src/test/clj/shelving/parsertest.clj conform/unform "normal form" round-trip testing!

Spec's normal form can also be used as the "parser" for the query compiler!

Query planning!

Traditional SQL query planning is based around optimizing disk I/O, typically by trying to do windowed scans or range order scans which respect the I/O characteristics of spinning disks.

This is below the abstractive level of Shelving!

Keys are (abstractly) unsorted, and all we have to program against is a write log anyway! For a purely naive implementation we really can't do anything interesting, we're stuck in an O(lvars) scan bottom.

Lets say we added indices - maps from ids of values of a spec to IDs of values of other specs they relate to. Suddenly query planning becomes interesting. We still have to do scans of relations, but we can restrict ourselves to talking about subscans based on relates-to information.

  • Take all lvars
  • Infer spec information from annotations & rels
  • Topsort lvars
  • Emit state-map -> [state-map] transducers & filters

TODO: - Planning using spec cardinality information - Simultaneous scans (rank-sort vs topsort) - Blocked scans for cache locality

Goodies

API management!

Documentation generation!

Covered previously on the blog - I wrote a custom markdown generator and updater to help me keep my docstrings as the documentation source of truth, and update the markdown files in the repo by inserting appropriate content from docstrings when it changes.

More fun still to be had

What makes Datalog really interesting is that among the many extensions which have been proposed is support for recursive rules.

Negation!

  • Really easy to bolt onto the parser, or enable as a query language flag
  • Doesn't invalidate any of the current stream/filter semantics
  • Closed world assumption, which most databases happily make

Recursive rules!

More backends!

Transactions!

  • Local write logs as views of the post-transaction state
  • transact! writes an entire local write log all-or-nothing
  • server? optimistic locking? consensus / consistency issues

Ergonomics!

The query DSL wound up super verbose unless you realy leverage the inferencer :c

Actually replacing Grimoire…

  • Should have just used a SQL ;) but this has been educational

Permalink

Software Developer (Flowerpilot) (m/f) at LambdaWerk GmbH (Full-time)

Join us as a software developer to grow a new product in the agricultural analytics field. Our client is a startup in the United States who is working on a plant identification system based on Ion Mobility Spectroscopy (IMS). We're looking for a lead programmer to work on scientific measurement workflow support systems, mobile applications and embedded integration system integrat

What you'll do:

Develop lab workflow support and demonstrator applications

Evaluate technology and implement prototypes

Integrate systems with existing deployment infrastructure

Support data science development for measurement classification

What we expect from you:

A desire to program

Professional experience with functional programming

Be comfortable with multiple programming languages

Experience with Clojure and the JVM is a plus

Willingness to embrace XML

Knowledge of web technology and internet protocols

Experience with GIT, Linux shell

Ability to communicate efficiently with your colleagues in English (written and spoken)

What we offer:

A nice office in Berlin

International project setting in a dynamic market

A small, focused and experienced team

Lots of interesting technology to learn and use

Training seminars and conference visits

A competitive salary

About LambdaWerk

We are a software development shop specializing in the implementation of data processing systems for healthcare and insurance companies. We are owned by a major US player in the dental healthcare space and our systems play a crucial role in their day-to-day operations. This is a permanent, full-time position in Berlin, Germany, requiring on-site presence; we are not presently offering visa sponsorship. We are very interested in increasing the diversity of our team, so please do not hesitate to apply especially if you're not a white male!ion to pave the path from prototype to product.

Get information on how to apply for this position.

Permalink

Lisp Developer, 3E, Brussels, Belgium

See: http://3eeu.talentfinder.be/en/vacature/30101/lisp-developer

You join a team of developers, scientists, engineers and business developers that develop, operate and commercialize SynaptiQ worldwide.

You work in a Linux-based Java, Clojure and Common Lisp environment. Your focus is on the development, maintenance, design and unit testing of SynaptiQ’s real-time aggregation and alerting engine that processes time-series and events. This data engine is Common Lisp based.

The objective is to own the entire lifecycle of the platform, that is from the architecture and development of new features to the deployment and operation of the platform in production environment. The position is open to candidates with no knowledge of LISP if they have a good affinity and experience in functional languages.

Permalink

A Beginner's journey inside the Clojure/ClojureScript Web Dev Ecosystem 1

I want to share with you my progress while i’m working on a web app, this project is mainly for (my) learning purposes, and by sharing this with you i’m aiming to get your help, advices or criticism, or maybe you are trying to build something similar, in the other side you may be just curious about creating something with the mentioned technologies and willing to see a (somehow) beginner trying his best to find the way through.

I’m still learning and experimenting and trying to get comfortable in the Clojure ecosystem, which is really awesome, but has a significant lack of documentations, and you can tell what does that mean for a beginner.
So, this will be a series of articles, in which i will talk about problems i face, and good things i discover or build.

The project is about a news social network “O2SN”, that will have the following features :

  • Getting news by location.
  • News will be ranked using various criteria : author’s honesty, objectivity … all those will be calculated automatically.
  • Seeing a (nearly) real time news ranks changes (highly ranked news get higher)
  • Everyone can post a story (which should be true and objective, otherwise the writer’s reputation will be affected, and his future news won’t rank well)
  • In case you want to post a story but a similar one already exists, you can claim it, the two stories (or more) will be merged (their ranks…), and people can see all the versions of (the newly) one story
  • People can mark a story as truth or lie (which will impact its ranking)
  • When a person marks a story as truth or lie its properties change according to the person’s reputation, and geographical distance between him and the story’s or event’s occurrence location.
  • And more ...

I created this project using luminus template, but i added some other libraries as well, so this is my current toolbox :

 :dependencies [[org.clojure/clojure "1.9.0"]
                 [org.clojure/clojurescript "1.10.238" :scope "provided"]
                 [org.clojure/tools.cli "0.3.6"]
                 [org.clojure/tools.logging "0.4.0"]
                 [org.clojure/data.json "0.2.6"]
                 [org.clojure/tools.trace "0.7.9"]
                 [org.clojure/tools.namespace "0.2.11"]
                 [org.clojure/test.check "0.10.0-alpha2"]
                 [buddy "2.0.0"]
                 [ch.qos.logback/logback-classic "1.2.3"]
                 [cider/cider-nrepl "0.15.1"]
                 [clj-oauth "1.5.5"]
                 [clj-time "0.14.3"]
                 [cljs-ajax "0.7.3"]
                 [compojure "1.6.0"]
                 [cprop "0.1.11"]
                 [funcool/struct "1.2.0"]
                 [luminus-aleph "0.1.5"]
                 [luminus-nrepl "0.1.4"]
                 [luminus/ring-ttl-session "0.3.2"]
                 [markdown-clj "1.0.2"]
                 [metosin/compojure-api "1.1.12"]
                 [metosin/muuntaja "0.5.0"]
                 [metosin/ring-http-response "0.9.0"]
                 [mount "0.1.12"]
                 [org.webjars.bower/tether "1.4.3"]
                 [cljsjs/semantic-ui-react "0.79.1-0"]
                 [cljsjs/react-transition-group "2.3.0-0"]
                 [cljsjs/react-motion "0.5.0-0"]
                 [re-frame "0.10.5"]
                 [reagent "0.7.0"]
                 [ring-webjars "0.2.0"]
                 [ring/ring-core "1.6.3"]
                 [ring/ring-defaults "0.3.1"]
                 [secretary "1.2.3"]
                 [selmer "1.11.7"]
                 [com.arangodb/arangodb-java-driver "4.3.4"]
                 [day8.re-frame/http-fx "0.1.6"]
                 [com.draines/postal "2.0.2"]]

At the end this the only working part right now ^ ^

signup form without errors

signup form with errors

It's just the beginning, and i hope i can end up with something that works (maybe few months later), and your help will be highly appreciated after two months or so, until i finish the backbone of the project and clean it up.

Finally, this is the project repository on Github : O2SN

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.