Toys for Coders

Anyone here remember Lincoln Logs or Tinker Toys? I recall many fun
hours building things with Erector sets (metal girders, plates and parts
held together with small screws and nuts). We had a merged set of 3 kits
which was stored in a steel carrying case. It had an AC powered motor
with a 2 speed gearbox and high speed shaft output. My brother and I
built an erector elevator that went up and down next to our bunk bed.
Of course you all know Lego blocks. What do these have to do with
dataflow?  Well, those toys all have a UAPI which allows their various
parts to be interconnected to each other. There are many other toys like
these and they all need a UAPI connection design. But just like the
shell tools and dataflow modules there are good and bad ways to connect
toy components. You build a wall with similar parts connected in a way
that will make a wall. It seems obvious to say but the point is
important - just because you can connect anything in any which way,
doesn't mean the different configurations are all equally valuable. If
you want an elevator, you have to use the given components with a plan
to make it. The plan is the application architecture and then you
implement it by connecting the parts together as designed.


Dataflow: What is it? Why should you care?

This article is an introduction to dataflow software
architecture. Dataflow is a way of organizing software components (called nodes
here) so that they don't call each other directly but instead send data
to each other. A dataflow system has several advantages over call stack
software including builtin networking and distribution, dynamic
configuration, high isolation and easier testing, debugging, tracing and
maintenance.  A dataflow system has several primary features. First, all
nodes communicate with each other via sending data and not via direct
calls. Secondly, nodes send and receive data via a universal API (see
below). Finally, the configuration of dataflow nodes is done extenally
from the nodes themselves.  Nodes are only configured with an address
(or more than one) to where they send data. A node just receives and/or
sends data but it has no knowledge or direct access to other nodes in
the application.


This article will take you on a long strange trip through fairy tales,
Ghostbusters, nightmare scenarios, toys, analogies, meta-dynamic
changes, diabolical dichotomies, neurons, heaven and hell,
pipes and ancient (by computer standards) history. 


This vs. That

You will find many sections in this article titled with This
vs. That. The reason is to emphasize the choices you have to make when
designing and coding.  Perl has a motto: There is more that one way to
do it. This doesn't mean all the ways are equally good.  You can write a
turing machine in basic and have it emulate linux running on a sparc but
it would be a somewhat poor choice. Even something as simple as looping
over an array has choices. In many languages the standard way is to loop
over the index values and access the array element via the index
variable. In Perl, the better choice is doing a foreach loop directly
over the array and the loop variable is aliased to each array element in
turn. This is a simple choice but the number of times I have seen C
style loops over arrays is astounding. There was no conscious choice
being made there, just a default to some way they know works at looping
over an array.  Too many coders are just happy to get something that
seems to work and neglect making any choices. Think of it this way, 
code is the record of your logical decisions and choices when analyzing
and solving a problem. The goal is to make a concious choice from the
coding options available to you. Learn the variations in the language
you use and choose the best one for the given situation. That is why
there are while/for (C style) and foreach loops in Perl as well as
statement modifiers. Each has their uses and best practice locations. If
you don't know all the looping styles, you are limiting your coding view.

There is Nothing New Under the Sun (workstation)

Dataflow isn't a new concept. I am just repackaging it in a Perlish
context. This is true for so many things in software.  What you may
think is a new concept in coding has likely been done before but you
didn't know it. Lisp was created in 1957 and is the grandparent of so
many language concepts. PL/I was called an early Perl by Mark-Jason
Dominus and I agree with that conceit. It has varying length strings,
dynamic allocation of memory for structure, built in record I/O (dbm
anyone?)  and more. There is a reason people study history in every
field, you can learn quite a bit from it including avoiding
mistakes. Larry Wall stole/borrowed/learned from many languages and
tools when he first created Perl. That historical knowledge led him to
choosing the set of features that let Perl be UNIX in a language. To
paraphrase Santayana, those who don't study the history of code are
condemned to recode it (badly).

External Configuration

In most code, calls to outside modules are made directly. Changing the
call stack involves changing the actual code and that can also entail
changing unit, integration and system testing. In dataflow, modules
never directly call each other but instead are configured with the
addresses of where they will send data. This configuration is external
to the actual code in dataflow modules and it effe


Universal API (UAPI)

Though the term UAPI may be new to you, the concept is well known to you
but you may not know where. The classic set of shell tools which use
stdio and pipes to communicate is a UAPI. Any program which can read
and/or write using the stdio handles can communicate with any other
program. And there are hundreds (more likely hundreds of thousands in
total all over) of these stdio tools. Beyond the agreement to use stdio,
many of these tools have requirements on their input and output. You
can't just connect any two of them together and expect a useful
result. Some like grep and similar filters can go almost
anywhere. Others are more specialized and handle or generate specific
formats. Most of the tools work on plain text but there is no
requirement for that as some (e.g. image munging) can handle binary
data.

But stdio tools and pipes have several major drawbacks. The I/O is
unidirectional, a pipeline is synchronous, you can't fork the data
stream (tee only writes to a file which doesn't really count), and it
isn't easily extended off the cpu (yes, ssh can do some of that).

This is one of the major analogies here: dataflow has a UAPI which is
very similar to that of the shell tools. Any node can send data to any
other node. And also not every combination of dataflow nodes will
make sense so it isn't universal in that dimension. But the drawbacks of
the shell tools are gone: you get data forking, asynchonousity and
distributed workloads.

Glossary

I will be using the term node to mean a dataflow node. It may be a
object but it doesn't have to be (implementation dependent). A dataflow
node is some code loaded from a module and given a unique address. It
will also be given addresses of where it sends data and any intializion
arguments it needs. Those come from the configuration which has this
information for all the dataflow nodes in the system.

A configuration is a set of data that can be used to create a dataflow
system. It has a list of the nodes to load, their own addresses, the
addresses they will send to and initialization arguments. It is
important to note that this is just some data structure that can be
edited, stored and loaded. This is important with dynamic configuration.

Data is anything you send from one node to another. It can be a stream
of text or media, a individual request, status information or
anything. The key point is that this is data, not objects, code or
anything some node might not like. As mentioned elsewhere, you can't
just send data to any other node - both sides must be in general
agreement of the format of the data.

A stream is a series of data packets going from one node to another. A
stream is set up just with two nodes with one of them having the
address of the other. Streams are inherently unidirectional but a
bidirectional setup easy to do by just having both nodes know the
other's address. Or data can be sent from one node to another and data
can be sent back to the sender (just like replying in email).  As with
shell tools, any node could send data to any other but they should be
in agreement regarding the format of that data. A data stream could be
just plain text (as with most shell tools) but it could also be a
command or a response (as in HTTP or similar protocols. Also the data is
in packets so they could be some data structure and not just text. You
are not limited to any particular format as long as the sender and
receiver agree on what is being sent.
Also data is never really a true
stream. At the TCP level streams are packetized and in dataflow they
will also need to be packetized. A single call to send data has to know
its size. A stream is just a sequence of data packets all carrying some
chunk of the total data stream.


Hardware Dataflow vs. Software Dataflow

The first influence was Arvind, (yes, that is his full name) who was a
post-doc whom I met at MIT when I was an undergraduate. Arvind focused
on hardware dataflow and parallel languages. Hardware dataflow is
essentially asynchronous logic such as the 7400 series of basic
integrated circuits that I was studying at the time. Today most IC's are
synchrnous which mean all the date between sections is clocked into
registers so the next section will know it has clean data signals on its
input ports. Asynchronous logic (i.e. dataflow) doesn't have those
registers which makes it faster and lower in power usage (no clock
lines all over connected to all those registers) but harder to design. I
read some of Arvind's work in this area and it has influenced much of my
work since then. I have passed data structured around (a simple form of
dataflow) in an RTOS in PDP-11 assembler (still my favorite CPU
architecture!), a crawler in c (for Nothern Light), a crawler in Perl
and other projects. I don't recall conciously use the dataflow ideas but
I am sure they were somewhere in my brain. 

Module Strength vs. Module Coupling.

At the same time when I met Arvind, I was also reading several books by
Glenford J. Myers. Myers has written several quality books on software
and hardware and I recommend you find used copies and read them. One
book, "Software Engineering Through Composite Design" covers his
concepts of module strength and module coupling. Module strength is
about measuring the quality of a module's API. Read only arguments and a
write only return value (like the common trig functions) are very
strong. Weaker modules use read/write or pass by reference
arguments. The weakest modules use shared memory. Remember, Myers was
writing in a time when Fortran was still very big and COMMON data (nasty
global spaces) were (ahem) common. Module strength is all about
insulating a module from being misused or abused by outside code. In the
other dimension module coupling is looking at how modules interact with
each other. A simple call stack is considered a strong way to couple
modules. And again, shared memory is a very weak way to couple
modules. I look at dataflow as an extension of those gradings. Dataflow
modules don't even use call stacks to connect (internal code in a
dataflow node does use a call stack) modules. The nodes don't even know
what module will be getting the data they send out, they only know the
address from the external configuration. I would say that is the loosest
possible coupling! And dataflow module strength is very high as they are
very isolated from each other and have a simple read only argument
coming in (the data) and usually only one write only argument going out
(outbound data). A dataflow module could send out more than one piece of
data but they are all write only from the perspective of the sender.


The third influence came when I was consulting with Akamai in the period
just before their IPO (and I didn't get any stock! :( ). What I found
there was a bunch of kids (as in 25 and under) reinventing all sorts of
networking tools and all different and poorly done in Perl. At some
point I proposed an idea to management to redo most of those tools to
use a common communications framework which would have major wins for
them. Unfortunately, my ideas didn't get developed and we soon parted
ways. So after that I set to thinking about how such a general purpose
communication framework would look. I took the idea of strong and
isolated modules from the composite design concept and went one step
further. The modules would communicate not via direct calls but by
sending data to each other. This made it possible to create the
application from a configuration file which loaded modules, created
objects and assigned them with addresses. The communications framework
then took over and allowed data to flow between the objects. I was
already very skilled with event loops so my prototype (more on this
below) use an event loop as its core and used message passing to
implement dataflow.

Interestingly, Myers' book advanced computer architectures, (3rd ed.)
has a chapter on hardware dataflow but he didn't bring that idea back
into his software work. I suggest you google for his books and I hope
they reward you as they did myself. They are somewhat out of date but
still have major concepts worth knowing.


Architecture vs. Coding


Object, aspect, waterfall, agile, spastic, fireplace, pear (sic) coding,
stand on your head meetings, procedural, functional, top down, bottom
up, inside out, etc.

Dataflow is a language independent architecture (see more below on
that). It is all about interconnections of modules

Design vs. Implementation

Asynchornous vs. Synchronous Logic

Event Loops vs. Threads

A major design issue with complex systems is how to multitask. The two
major techniques are event loops and (kernel or language) threads.  I
have a big bias for event loops over threads as I have done several
major projects with event loops. In fact my first serious production
project was an RTOS written in PDP-11 assembler that was effectively an
event loop as well as having dataflow features.  But a better comparison
is to explore the strengths and weaknesses of each of them. Threads have
one major win - blocking. They can block on an I/O or system call and
let another thread (or process) continue. Threads also have a shared
memory space which is both good and bad - data needs to be locked (with
a semaphore or similar) before it can be modified. Managing threads is
also tricky with spawning them, semaphores, communication and merging
when they are done. Also threads bring nothing to the table regarding
sharing data outside of their box. Communication between local threads
is very different than communication to remote threads. Event loops in
general don't have those problems but their big one is blocking. A
blocking operation will stop the whole event loop from continuing and
that is very bad. But event loops have shared memory but don't need any
locks or semaphores as only one logical thread is running at a
time. Also sending data between components in a event loop is similar to
sending data to another system.

Wouldn't it be nice to be able to have the flexibility to have event
loops doing the main work and using threads for blocking operations?
Well, it can and has been done. One project I had was writing a major
web crawler for Northern Light (back in the days of AltaVista and way
before Google). I created a event loop system in a single process that
could fetch 2000 web pages in parallel. It worked very well but when it
was required to do reverse DNS lookups I ran into a snag. Those lookups
were blocking as it called a library to do it and it did calls to DNS. I
did come up with a neat (IMNHSO) solution that let kernel threads (this
was in C) do those blocking calls and they were integrated into the
event loop. I will leave it to the reader to figure out the solution and
my design in covered at the end of this article. Note that there is a
CPAN module (not mine) which does this as well but I have a better
design in mind that would be much more flexible and easier to extend
with new services.

Overhead vs. Flexibility

I am sure some of you are already screaming out about the overhead of
having all data be passed through a UAPI. Of course, direct calls are
faster but dataflow is more flexible. The next step is to see which of
those two characteristics is more important to the project. I will
choose flexibility as you can almost always speed things up with faster
code or more CPU but you usually can't retrofit flexibility into a
existing system. If you want to look at the cost issue, let's go back to
the days of timesharing on mainframes. Each user had all their cpu and
other resources accounted for and you had to be granted some form of
budget to pay for it. You effectively rented cpu time because it was so
expensive to own a computer. Thousands of users may have shared one
system with batch and interactive programs. Well we think things have
changed but other than the cost, we still pay for cpu time. Clouds,
co-location racks and internal server farms all have a cost. The real
change is in the cost of humans doing the code for those boxes. 


Hardware Dataflow

To Be An Object vs. Not To Be An Object

That should be the question. I first learned about objects in a class at
MIT taught by professor Barbara Liskov, one of the pioneers of OO
software. This was before any languages had proper OO features and we
used PL/I as it was available and had the ability to declare and
allocate data structures. This was the key feature, the ability to
collect related data into one structure. The other major feature was to
collect the code that knew how to handle that structure into one module.
You need to understand how much of a relevation this was. Code was ruled
by individual variables. Structures were usually declared but any code
could mess with them. Code was king and data was just the peasants to be
ordered around. With OO, data became king and code was now the
supporting court. I took those lessons to heart as many of my projects
have been data driven. Data structures were created and passed around
via pointers and the code was much simpler as it usually only modified
that structure. It dramatically reduced the use of globals as the data
became localized. In my first commercial project, I wrote an RTOS in
PDP-11 assembler (still my favorite CPU architecture). It used those
basic OO principle by allocating control and data structures and passing
them from input drivers to processing code and then to output
queues. Yes, you can do OO in assembler if you organize it that way.

Since then many OO features have come along but they are mostly bells
and whistles and syntactic sugar. Methods, polymorphism, inheritance and
others can be emulated if you don't have them. But passing around data
can't be done if all you have are globals and common space.  This brings
us to the data passed around in dataflow. It is data, and not objects
for a major reason. Since a node will send out data but not know what
node will get it, it can't send an object. The receiving node may not
have the class code for that object available. But if it is just plain
data (likely a buffer or a structure), the receiving node doesn't have
to have any special code to handle and process that data. On the other
hand, nodes will likely be objects themselves. This allows for
polymorphism where different modules can be used to implement a node but
keep the same API regarding the type of data they expect to receive. 

So let's get back to the title of this section.  A case I saw was where
a main object had several subsections of data inside it. The subsections
were generally isolated from each other and they were not reused so they
didn't need to be made into objects but they were. More of the code was
spent declaring object data and methods and such instead the actual code
doing work. And since there were times when a subsection needed some
data from another subsection, it had to climb up the object stack and
then call down into a sibling object which created very fugly code. My
design would have been to leave those subsections as hashes under the
main object and then access would have been easy and clean. Data doesn't
always need to be protected from internal access by methods and
accessors.

My point here is too often just making something into
an object is the default decision. In some cases that choice may be
detrimental to the design. The most common case is with singleton
objects. They can usually be implemented with procedural code and
lexically isolated data just as easily as making them into a class or
single instance object. There usually isn't any benefit to making it
into an object but everyone does it. My CPAN module File::Slurp is
purely procedural but other versions of it are objects with slurp
methods. I feel simpler is better here. 

You Can Cross the Streams!

Dataflow is not Ghostbusters so you can and should cross the
streams. Here are two scenarios. You have an existing call stack in an
application and you want to trace and replay data at a certain point in
the stack. This will involve rewriting the code there to handle copying
all of its data to some output location, being able to either read that
data and make calls down the stack and also being able to just pass down
calls from above. There are existing trace and replay systems but they
usually work on a wire where you can more easily interpose this
code. There are HTTP proxies that can do this for browsers and similar
things elsewhere but few do it inside a call stack. The other scenario
is where you have data in a call stack and you want to also send it
elsewhere - maybe copying it or just redirecting it and possibly to
multiple destinations. This is doable but you may need to handle options
to control when and where this redirection happens. The code will have
to know all about the call stack and the API's of the
destinations. Those are beginnings of nightmare scenarios - they can be
done with some good coding. Now, what about doing those in multiple
places and with more destinations - you have a full blown nightmare
now. And to relieve your dark dreams, I will tell you that dataflow
makes those problems go away with little effort. Because there is no
traditional call stack between nodes, you can interpose a
multiplexing/switching node anywhere you want. This node can receive
data, duplicate one or more times and send it on to different
destination nodes. It is actually fairly simple code to write as it just
duplicates data and resends it with different addresses. How data is
multiplexed is controled just by maps inside that switching node and
those can be configured at load time and changed at run time. You can
dynamically route data whereever you want. This node can be used for
broadcasting a single stream to multiple locations, publish/subscribe
designs, and funneling (merging multiple incoming streams to one output
stream. Imagine the fun you will have when you can just cross those
streams and not be scared of doing it!


Email as Dataflow

Here is another good analogy for you to absorb - dataflow addresses are
very similar to email addresses. They are globally (within your dataflow
system) unique and have a hierarchal naming space. Email addresses have
the primary 2 parts of name@example.com but the domain can have
subdomains (you don't see that much now but in earlier mail history it
was more common). The subdomain would locate which mail server to use
when inside the top level domain's network. In dataflow an address can
be used to locate a destination node but also a sub-part of that
address could identify an individual node in a set of related ones
(which share a higher level address part). 

Old Fashioned Threaded Code

I bet you didn't know that the term 'threaded code' was in use before
kernel and language threads became popular. And the concept is still
widely in use but just not called threaded code. The old usage was where
your main code body is a set of subroutines and the logic flow was just
a sequence of calls (the threading) to those routines. This was
something easy to code up and to generate the logic flow. It is still
used by many interpreters but they loop over the op code tree instead of
directly executing the main logic. I bring this up because later on this
concept is used to handle asynchronous flow control.

High Isolation

The sections on module strength and coupling emphasize module
independence. Dataflow modules strive to reach that goal by not having
direct calls and not loading up modules which aren't needed. This gives
code high isolation from other code which is a big win as it makes for
very loose coupling. A given module doesn't know or care about outside
code so it can be designed, maintained, rewritten, debugged in isolation
which makes for a much easier to manage system. 

Dependency Heaven

I am sure you know about dependency hell. It is where a module or
application loads what seems like half of CPAN and is delicately
balanced on that stack of code. One mistake installing down below and
the whole stack comes crashing down. The length of the CPAN installation
output feels like it is in gigabytes! Not fun. Well, dataflow creates a
dependency heaven instead. By eliminating the inheritance and direct
calls to modules, you don't need to install the world just to get
something installed and running. Dataflow modules are designed to be
independent of each other and you only need to install and load the
actual modules you use and not everything that could be used. This
should give you a warm and fuzzy feeling you have never gotten before
from software.

Interpose at Will

Here is another painful scenario. You have a call stack and you want to
insert some more processing (e.g filtering, formatting) at a certain
point. Of course this is doable but if you use another sub for this you
are just making your call stack even bigger and harder to manage. The
code you write to insert this new code is custom and not reusable. Your
testing code for this subsystem will likely need as much changing as the
code itself. And as before, wait until you need to do this in multiple
places. As with the switching/multiplexing node covered elsewhere, you
can easily interpose a dataflow node between any others in your
network. The surrounding nodes don't need to change any code and your
testing system may need some minor changes. You can even have two
versions with and without the extra processing just by having two
different configurations. Or you could have the extra code switched in
and out by using that switch node. This is one of the essences of
dataflow, total control over what data goes where. Nodes do work but
where data flows is controlled by the configuration.

Unit Testing

When you have highly isolated modules, it is much easier to to unit
testing. You don't have to worry about other code in the stack affecting
the module under test. You don't need to load up the whole stack to test
this module. You can do more comprehensive testing because only the code
in the module under test is being run and you could even do a full
coverage test (if it is Goldilocks sized! :). Dataflow modules also can
be tested without much scaffolding (no mock nodes or complex testing
structures). What you do is create a specific module to test this
module. The testing module can generate and receive all the data
received and sent by the module under test. A simple configuration needs
to be created that loads the testing and tested module with addresses
that connect each to the other. The testing module generates data which
is sent to the tested module. Data sent back is received and checked for
correctness. The classic ok/not ok (TOP) protocol can be used too so the
standard Test::* modules work too.

Integration Testing

Testing a subsystem of a large complex application can be a very
difficult process. The big problem is finding a clean boundary which
encompasses the subsystem and creating a testing structure that can
drive it and check for correctness. As mentioned elsewhere (in
Interposing), if you need to change that boundary (adding new features,
layers, etc.) then it can become a major pain. With dataflow, you can
select any subset (connected by addresses) to be a subsystem under
test. The same concept used in unit testing can be used for a
subsystem. All you need is to know what data is being sent to and
received from the subsystem and you can create a testing module for
it. Just as with unit testing there needs to be a configuration that
both loads the subsystem modules and also the testing module and sets
the addresses so the subsystem is only communicating with the testing
module. Adding more features or layers to the subsystem is easier. You
can even leave the existing testing module and its configuration as is
but clone them and edit them to load and test the new subsystem. You can
choose any boundary in the full application to be a subsystem under
test - that alone will make some of you very happy.

Goldilocks Sizing

Yes, I said a fairy tale would be part of this and it is a pretty good
way to teach this point. A crucial element when designing software is
how large should some element be? An element can be a sub, a loop, a
module or even a subsystem. The size is related to the level and to
other elements at that level. We have all seen monstrously large systems
which will engulf half of CPAN. Then there are loops and subs that are
pages long or maybe 1 line doing too little. Size matters (in code). The
real question is how can you tell when something is the right size. One
way is to see if the feature set is solid and does what is needed and no
more. Is the element generic or custom oriented? You want to keep
generic ones smaller as they are more easily shared. Some of this will
only come with experience and reading lots of code. And it will require
discipline to keep a module the right size once you have gotten it
there. 

A case history of a major failure will illustrate this principle. I
consulted at a firm which had a complex application with many variations
and combinations of possible features for each customer. They were
writing custom code for each situation. I proposed they analyze all the
variations and factor out any common things (which were plenty
IMNSHO). Then they could write a threaded code or interpreter to execute
the common stuff as well as a way to call custom code. They said they
had tried that already. But the coders kept adding stuff into the
generic code to make it custom for a user. They broke the goldilocks
principle and now they still had massive amounts of unshareable custom
code but in large individual operations in this system. The new system
was not winning then as it was just calling complete custom code for any
situation. The showed no discipline in keeping generic code generic. The
had an epidemic of creeping featuritis.

The goldilocks principle is meant to guide you in keeping things the
best size for their purpose. But you have to be on guard lest you sleep
in the wrong bed or eat the wrong porridge.


Dynamic Configuration

Remember earlier when I said that a configuration is just a data
structure? Well, this is where it pays off in gold. You can
auto-generate a configuration and start a dataflow application with
it. A configuration can be pushed to a remote system (via dataflow of
course!). Or a remote system can boot a simple configuration which them
requests a configuration from elsewhere in the system. Your system can
dynamically change as its load and needs change. You can even add new
configuration to a running dataflow application and have it just add
those on the fly. It could change things or just add services.  Dynamic
configuration in dataflow is like is a weather control system for your
compute cloud. Here is another idea - say your live application has had
its configuration changed since it was started up. It is easy to dump
and store that existing configuration and use it at some future time
when you restart the application.

Plugins are Not Allowed

A plugin is a set of subs/modules which use a common API and the main
code will select one of them to call or use based on some key. The
simplest plugin is the classic dispatch table - a hash of keys with code
references as values. The module level plugin is commonly done with OO
and the different modules offer the same methods. This is also called
polymorphism. So why does dataflow not allow plugins? Well, because in
dataflow most every node/module is already a plugin. Since they
already use the UAPI, you can select from a set with similar
services. Choosing which one to use can be done in a hash or in a
multiplexor node (link to section). So it is more a case of not
needing a plugin layer or special wrappers to support plugins.


Perl6 and Dataflow


Language vs. Language

No, I am not going to bring up any language wars. I will state that
dataflow is a way to end those wars!

We all know how great CPAN is but sometimes someone using another
language than Perl comes up with a cool module. Hell, it might even be
in Perl 6 (which is supposed to one day be able to run Perl5 code)!
Anyhow, there are ways for some languages to be embedded and called by
another. No big whoop there. But just picking up a foreign module and
plopping it into your system is not something you just do whenever you
want to. There may be special wrapper code you need to write to match
the calling style or do data conversion or some other considerations. Of
course you can just bolt that module onto some messaging backbone like
rabbit mq but then you require that module to be in another
process. That isn't really using multiple languages in one process
anymore. With dataflow this become much simpler. Since each language can
have its own library to interface with the core dataflow system, the
issue of calling and handling data is done once. Any module then written
to use dataflow in either language can be used in one system. The only
other issue is runtime and if both languages can be compiled to the same
engine such as JVM, then you are in business. You now can mix and match
any dataflow module in any supported language. They can loaded in the same
process, run different boxes or be across the net. This is real language
independence.


Everybody Wins!


Multiple Development Layers

module api design
implementing modules
creating applications

Deconstructing grep

To illustrate how you can take a larger application and break it down
into dataflow components, I will deconstruct the well known grep
utility. Grep loops over lines and runs a regex against them. It has
(too) many options and we will only cover a few of them here. These
components are very small and you wouldn't do this in a real project but
they are good for illustration purposes.

The first component is a line extractor (I am skipping how the file is
read into a buffer or doing infinite streaming). It has a buffer of a
file and it just pulls out the next line and sends it to the next
section. It adds some information to the line such as the line number,
and the file name so a structure is sent and not just the line. This
component could easily be reused in any other application that needs
line by line processing.

The next component is the regex tester. It is configured with the regex
(and it could be changed at run time!). All it does is test the line
against the regex and adds a boolean to the data structure. It can also
add the offsets of the match (if it matched) to that structure. It sends
this augmented structure to the next component.

This component is the actual filter itself. It is configured with a
boolean that says to pass matched or unmatched lines (like default or -v
option in grep). It can be configured to also do context by storing
lines before and after a match and adding those to the structure.

Another component can handle the option to only output file names which
had a match. As an optimization it can send data back to the line
extractor to stop processing the current file and move on to the next
one. This is a simple example of a feedback loop which is easily created
with dataflow.

Another issue not addressed is throttling and buffering. This can be
handled by another feedback loop or even by letting the system buffers
do the throttling. As this is not a real world example I won't go
further here.

Those components are small, very focused and easily written and
tested. Some can be reused in other applications and you can quickly
replace then in an existing application. At run time you can even change
behavior (if you design your components that way) which is almost
impossible with a monolithic utility like grep. When testing the actual
grep utility all the possible combinations of options and features need
to be covered. This is a pretty difficult problem to solve and it shows
why breaking something complex down to simpler components is a big
win. The key here is each component is isolated from the others and can
be developed and tested independently. If you modify the grep utility
you may not realize what ramifications it has but when you modify a
simple isolated component, there are fewer possible issues to deal with.

Solution to Event Loops and Threads

I am glad you made it to this part where I reveal my design for
integrating event loops and threads. The key is a pipe that is used only
inside the single process (pipes are almost always used between
processes). On the event loop side of the pipe, you create a request in
a data structure and then write the pointer to that structure into the
pipe. The other side has a farm of threads which read a whole pointer
from the pipe (you could create a pipe per thread too). One thread will
read the pointer, take the data structure and do the request. It can
block all it wants now. When it is done, it puts the results into that
structure and writes the pointer back into the pipe from the thread
side. On the event loop side, that pipe handle is tracked with a
readable event. When readable, that means a result pointer has been sent
from a thread. It gets read and a callback is made to the original
requester which then processes the results. So now you can have an event
loop that sends blocking operations to a thread and it can continue
running. 


A Biblical Parabola vs. A Quadratic Story

In the beginning there was a void called Jersey wherein lay the labs of
bell. Dwelling in the darkest dungeon there were two intelligent
designers.  They had a kernel of an idea, which was the idea of a
kernel!  And they called it UNIQUES because it was the unique kernel
that was multiplied over the 9-track highway. And the kernel users saw
this was good and rejoiced.

The UNIQUES kernel begat init(1) which then begat all the known
processes in the world including a protective shell.  And the users gave
their commandments to the kernel and they were executed and obeyed. some
users got tired of saying the same commandments over and over and so
they engraved as scriptures on papyrus cards and disks.  Today, many of
those commandment scriptures are written on tablets.

Eons (weeks) later the users whined to the intelligent designers: We
want our process children to be a family and not so separated.  The
intelligent designers put that whine into a pipe to smoke it and they
found that putting data into a pipe was even better!  Thus a simple form
of IPC and one of the most popular UAPI's was creat'd (sic).  The users
smoked a lot of data in pipes for generations and really rejoiced.


Then demons arose the in the kingdom of Berkeley who wanted to control
their daemons. One hippie among them wanted some data sent to him and he
said "Sock it to me!!". And that is how sockets were born!

Larry sed, with a lisp, "'Grep' and 'awk' are the sounds i make when I C
data flow. So he munged all that together and made the data munger
he called Perl.

In Conclusion

If you are interested in talking about dataflow, please contact me.
If you have a dataflow type project, contact me today!
If you have funding for a dataflow type project, contact me yesterday!

I am available for Perl work so let me know if you have any openings.

Uri Guttman
uri@perlhunter.com
781-643-7504