Toys for Coders Anyone here remember Lincoln Logs or Tinker Toys? I recall many fun hours building things with Erector sets (metal girders, plates and parts held together with small screws and nuts). We had a merged set of 3 kits which was stored in a steel carrying case. It had an AC powered motor with a 2 speed gearbox and high speed shaft output. My brother and I built an erector elevator that went up and down next to our bunk bed. Of course you all know Lego blocks. What do these have to do with dataflow? Well, those toys all have a UAPI which allows their various parts to be interconnected to each other. There are many other toys like these and they all need a UAPI connection design. But just like the shell tools and dataflow modules there are good and bad ways to connect toy components. You build a wall with similar parts connected in a way that will make a wall. It seems obvious to say but the point is important - just because you can connect anything in any which way, doesn't mean the different configurations are all equally valuable. If you want an elevator, you have to use the given components with a plan to make it. The plan is the application architecture and then you implement it by connecting the parts together as designed. Dataflow: What is it? Why should you care? This article is an introduction to dataflow software architecture. Dataflow is a way of organizing software components (called nodes here) so that they don't call each other directly but instead send data to each other. A dataflow system has several advantages over call stack software including builtin networking and distribution, dynamic configuration, high isolation and easier testing, debugging, tracing and maintenance. A dataflow system has several primary features. First, all nodes communicate with each other via sending data and not via direct calls. Secondly, nodes send and receive data via a universal API (see below). Finally, the configuration of dataflow nodes is done extenally from the nodes themselves. Nodes are only configured with an address (or more than one) to where they send data. A node just receives and/or sends data but it has no knowledge or direct access to other nodes in the application. This article will take you on a long strange trip through fairy tales, Ghostbusters, nightmare scenarios, toys, analogies, meta-dynamic changes, diabolical dichotomies, neurons, heaven and hell, pipes and ancient (by computer standards) history. This vs. That You will find many sections in this article titled with This vs. That. The reason is to emphasize the choices you have to make when designing and coding. Perl has a motto: There is more that one way to do it. This doesn't mean all the ways are equally good. You can write a turing machine in basic and have it emulate linux running on a sparc but it would be a somewhat poor choice. Even something as simple as looping over an array has choices. In many languages the standard way is to loop over the index values and access the array element via the index variable. In Perl, the better choice is doing a foreach loop directly over the array and the loop variable is aliased to each array element in turn. This is a simple choice but the number of times I have seen C style loops over arrays is astounding. There was no conscious choice being made there, just a default to some way they know works at looping over an array. Too many coders are just happy to get something that seems to work and neglect making any choices. Think of it this way, code is the record of your logical decisions and choices when analyzing and solving a problem. The goal is to make a concious choice from the coding options available to you. Learn the variations in the language you use and choose the best one for the given situation. That is why there are while/for (C style) and foreach loops in Perl as well as statement modifiers. Each has their uses and best practice locations. If you don't know all the looping styles, you are limiting your coding view. There is Nothing New Under the Sun (workstation) Dataflow isn't a new concept. I am just repackaging it in a Perlish context. This is true for so many things in software. What you may think is a new concept in coding has likely been done before but you didn't know it. Lisp was created in 1957 and is the grandparent of so many language concepts. PL/I was called an early Perl by Mark-Jason Dominus and I agree with that conceit. It has varying length strings, dynamic allocation of memory for structure, built in record I/O (dbm anyone?) and more. There is a reason people study history in every field, you can learn quite a bit from it including avoiding mistakes. Larry Wall stole/borrowed/learned from many languages and tools when he first created Perl. That historical knowledge led him to choosing the set of features that let Perl be UNIX in a language. To paraphrase Santayana, those who don't study the history of code are condemned to recode it (badly). External Configuration In most code, calls to outside modules are made directly. Changing the call stack involves changing the actual code and that can also entail changing unit, integration and system testing. In dataflow, modules never directly call each other but instead are configured with the addresses of where they will send data. This configuration is external to the actual code in dataflow modules and it effe Universal API (UAPI) Though the term UAPI may be new to you, the concept is well known to you but you may not know where. The classic set of shell tools which use stdio and pipes to communicate is a UAPI. Any program which can read and/or write using the stdio handles can communicate with any other program. And there are hundreds (more likely hundreds of thousands in total all over) of these stdio tools. Beyond the agreement to use stdio, many of these tools have requirements on their input and output. You can't just connect any two of them together and expect a useful result. Some like grep and similar filters can go almost anywhere. Others are more specialized and handle or generate specific formats. Most of the tools work on plain text but there is no requirement for that as some (e.g. image munging) can handle binary data. But stdio tools and pipes have several major drawbacks. The I/O is unidirectional, a pipeline is synchronous, you can't fork the data stream (tee only writes to a file which doesn't really count), and it isn't easily extended off the cpu (yes, ssh can do some of that). This is one of the major analogies here: dataflow has a UAPI which is very similar to that of the shell tools. Any node can send data to any other node. And also not every combination of dataflow nodes will make sense so it isn't universal in that dimension. But the drawbacks of the shell tools are gone: you get data forking, asynchonousity and distributed workloads. Glossary I will be using the term node to mean a dataflow node. It may be a object but it doesn't have to be (implementation dependent). A dataflow node is some code loaded from a module and given a unique address. It will also be given addresses of where it sends data and any intializion arguments it needs. Those come from the configuration which has this information for all the dataflow nodes in the system. A configuration is a set of data that can be used to create a dataflow system. It has a list of the nodes to load, their own addresses, the addresses they will send to and initialization arguments. It is important to note that this is just some data structure that can be edited, stored and loaded. This is important with dynamic configuration. Data is anything you send from one node to another. It can be a stream of text or media, a individual request, status information or anything. The key point is that this is data, not objects, code or anything some node might not like. As mentioned elsewhere, you can't just send data to any other node - both sides must be in general agreement of the format of the data. A stream is a series of data packets going from one node to another. A stream is set up just with two nodes with one of them having the address of the other. Streams are inherently unidirectional but a bidirectional setup easy to do by just having both nodes know the other's address. Or data can be sent from one node to another and data can be sent back to the sender (just like replying in email). As with shell tools, any node could send data to any other but they should be in agreement regarding the format of that data. A data stream could be just plain text (as with most shell tools) but it could also be a command or a response (as in HTTP or similar protocols. Also the data is in packets so they could be some data structure and not just text. You are not limited to any particular format as long as the sender and receiver agree on what is being sent. Also data is never really a true stream. At the TCP level streams are packetized and in dataflow they will also need to be packetized. A single call to send data has to know its size. A stream is just a sequence of data packets all carrying some chunk of the total data stream. Hardware Dataflow vs. Software Dataflow The first influence was Arvind, (yes, that is his full name) who was a post-doc whom I met at MIT when I was an undergraduate. Arvind focused on hardware dataflow and parallel languages. Hardware dataflow is essentially asynchronous logic such as the 7400 series of basic integrated circuits that I was studying at the time. Today most IC's are synchrnous which mean all the date between sections is clocked into registers so the next section will know it has clean data signals on its input ports. Asynchronous logic (i.e. dataflow) doesn't have those registers which makes it faster and lower in power usage (no clock lines all over connected to all those registers) but harder to design. I read some of Arvind's work in this area and it has influenced much of my work since then. I have passed data structured around (a simple form of dataflow) in an RTOS in PDP-11 assembler (still my favorite CPU architecture!), a crawler in c (for Nothern Light), a crawler in Perl and other projects. I don't recall conciously use the dataflow ideas but I am sure they were somewhere in my brain. Module Strength vs. Module Coupling. At the same time when I met Arvind, I was also reading several books by Glenford J. Myers. Myers has written several quality books on software and hardware and I recommend you find used copies and read them. One book, "Software Engineering Through Composite Design" covers his concepts of module strength and module coupling. Module strength is about measuring the quality of a module's API. Read only arguments and a write only return value (like the common trig functions) are very strong. Weaker modules use read/write or pass by reference arguments. The weakest modules use shared memory. Remember, Myers was writing in a time when Fortran was still very big and COMMON data (nasty global spaces) were (ahem) common. Module strength is all about insulating a module from being misused or abused by outside code. In the other dimension module coupling is looking at how modules interact with each other. A simple call stack is considered a strong way to couple modules. And again, shared memory is a very weak way to couple modules. I look at dataflow as an extension of those gradings. Dataflow modules don't even use call stacks to connect (internal code in a dataflow node does use a call stack) modules. The nodes don't even know what module will be getting the data they send out, they only know the address from the external configuration. I would say that is the loosest possible coupling! And dataflow module strength is very high as they are very isolated from each other and have a simple read only argument coming in (the data) and usually only one write only argument going out (outbound data). A dataflow module could send out more than one piece of data but they are all write only from the perspective of the sender. The third influence came when I was consulting with Akamai in the period just before their IPO (and I didn't get any stock! :( ). What I found there was a bunch of kids (as in 25 and under) reinventing all sorts of networking tools and all different and poorly done in Perl. At some point I proposed an idea to management to redo most of those tools to use a common communications framework which would have major wins for them. Unfortunately, my ideas didn't get developed and we soon parted ways. So after that I set to thinking about how such a general purpose communication framework would look. I took the idea of strong and isolated modules from the composite design concept and went one step further. The modules would communicate not via direct calls but by sending data to each other. This made it possible to create the application from a configuration file which loaded modules, created objects and assigned them with addresses. The communications framework then took over and allowed data to flow between the objects. I was already very skilled with event loops so my prototype (more on this below) use an event loop as its core and used message passing to implement dataflow. Interestingly, Myers' book advanced computer architectures, (3rd ed.) has a chapter on hardware dataflow but he didn't bring that idea back into his software work. I suggest you google for his books and I hope they reward you as they did myself. They are somewhat out of date but still have major concepts worth knowing. Architecture vs. Coding Object, aspect, waterfall, agile, spastic, fireplace, pear (sic) coding, stand on your head meetings, procedural, functional, top down, bottom up, inside out, etc. Dataflow is a language independent architecture (see more below on that). It is all about interconnections of modules Design vs. Implementation Asynchornous vs. Synchronous Logic Event Loops vs. Threads A major design issue with complex systems is how to multitask. The two major techniques are event loops and (kernel or language) threads. I have a big bias for event loops over threads as I have done several major projects with event loops. In fact my first serious production project was an RTOS written in PDP-11 assembler that was effectively an event loop as well as having dataflow features. But a better comparison is to explore the strengths and weaknesses of each of them. Threads have one major win - blocking. They can block on an I/O or system call and let another thread (or process) continue. Threads also have a shared memory space which is both good and bad - data needs to be locked (with a semaphore or similar) before it can be modified. Managing threads is also tricky with spawning them, semaphores, communication and merging when they are done. Also threads bring nothing to the table regarding sharing data outside of their box. Communication between local threads is very different than communication to remote threads. Event loops in general don't have those problems but their big one is blocking. A blocking operation will stop the whole event loop from continuing and that is very bad. But event loops have shared memory but don't need any locks or semaphores as only one logical thread is running at a time. Also sending data between components in a event loop is similar to sending data to another system. Wouldn't it be nice to be able to have the flexibility to have event loops doing the main work and using threads for blocking operations? Well, it can and has been done. One project I had was writing a major web crawler for Northern Light (back in the days of AltaVista and way before Google). I created a event loop system in a single process that could fetch 2000 web pages in parallel. It worked very well but when it was required to do reverse DNS lookups I ran into a snag. Those lookups were blocking as it called a library to do it and it did calls to DNS. I did come up with a neat (IMNHSO) solution that let kernel threads (this was in C) do those blocking calls and they were integrated into the event loop. I will leave it to the reader to figure out the solution and my design in covered at the end of this article. Note that there is a CPAN module (not mine) which does this as well but I have a better design in mind that would be much more flexible and easier to extend with new services. Overhead vs. Flexibility I am sure some of you are already screaming out about the overhead of having all data be passed through a UAPI. Of course, direct calls are faster but dataflow is more flexible. The next step is to see which of those two characteristics is more important to the project. I will choose flexibility as you can almost always speed things up with faster code or more CPU but you usually can't retrofit flexibility into a existing system. If you want to look at the cost issue, let's go back to the days of timesharing on mainframes. Each user had all their cpu and other resources accounted for and you had to be granted some form of budget to pay for it. You effectively rented cpu time because it was so expensive to own a computer. Thousands of users may have shared one system with batch and interactive programs. Well we think things have changed but other than the cost, we still pay for cpu time. Clouds, co-location racks and internal server farms all have a cost. The real change is in the cost of humans doing the code for those boxes. Hardware Dataflow To Be An Object vs. Not To Be An Object That should be the question. I first learned about objects in a class at MIT taught by professor Barbara Liskov, one of the pioneers of OO software. This was before any languages had proper OO features and we used PL/I as it was available and had the ability to declare and allocate data structures. This was the key feature, the ability to collect related data into one structure. The other major feature was to collect the code that knew how to handle that structure into one module. You need to understand how much of a relevation this was. Code was ruled by individual variables. Structures were usually declared but any code could mess with them. Code was king and data was just the peasants to be ordered around. With OO, data became king and code was now the supporting court. I took those lessons to heart as many of my projects have been data driven. Data structures were created and passed around via pointers and the code was much simpler as it usually only modified that structure. It dramatically reduced the use of globals as the data became localized. In my first commercial project, I wrote an RTOS in PDP-11 assembler (still my favorite CPU architecture). It used those basic OO principle by allocating control and data structures and passing them from input drivers to processing code and then to output queues. Yes, you can do OO in assembler if you organize it that way. Since then many OO features have come along but they are mostly bells and whistles and syntactic sugar. Methods, polymorphism, inheritance and others can be emulated if you don't have them. But passing around data can't be done if all you have are globals and common space. This brings us to the data passed around in dataflow. It is data, and not objects for a major reason. Since a node will send out data but not know what node will get it, it can't send an object. The receiving node may not have the class code for that object available. But if it is just plain data (likely a buffer or a structure), the receiving node doesn't have to have any special code to handle and process that data. On the other hand, nodes will likely be objects themselves. This allows for polymorphism where different modules can be used to implement a node but keep the same API regarding the type of data they expect to receive. So let's get back to the title of this section. A case I saw was where a main object had several subsections of data inside it. The subsections were generally isolated from each other and they were not reused so they didn't need to be made into objects but they were. More of the code was spent declaring object data and methods and such instead the actual code doing work. And since there were times when a subsection needed some data from another subsection, it had to climb up the object stack and then call down into a sibling object which created very fugly code. My design would have been to leave those subsections as hashes under the main object and then access would have been easy and clean. Data doesn't always need to be protected from internal access by methods and accessors. My point here is too often just making something into an object is the default decision. In some cases that choice may be detrimental to the design. The most common case is with singleton objects. They can usually be implemented with procedural code and lexically isolated data just as easily as making them into a class or single instance object. There usually isn't any benefit to making it into an object but everyone does it. My CPAN module File::Slurp is purely procedural but other versions of it are objects with slurp methods. I feel simpler is better here. You Can Cross the Streams! Dataflow is not Ghostbusters so you can and should cross the streams. Here are two scenarios. You have an existing call stack in an application and you want to trace and replay data at a certain point in the stack. This will involve rewriting the code there to handle copying all of its data to some output location, being able to either read that data and make calls down the stack and also being able to just pass down calls from above. There are existing trace and replay systems but they usually work on a wire where you can more easily interpose this code. There are HTTP proxies that can do this for browsers and similar things elsewhere but few do it inside a call stack. The other scenario is where you have data in a call stack and you want to also send it elsewhere - maybe copying it or just redirecting it and possibly to multiple destinations. This is doable but you may need to handle options to control when and where this redirection happens. The code will have to know all about the call stack and the API's of the destinations. Those are beginnings of nightmare scenarios - they can be done with some good coding. Now, what about doing those in multiple places and with more destinations - you have a full blown nightmare now. And to relieve your dark dreams, I will tell you that dataflow makes those problems go away with little effort. Because there is no traditional call stack between nodes, you can interpose a multiplexing/switching node anywhere you want. This node can receive data, duplicate one or more times and send it on to different destination nodes. It is actually fairly simple code to write as it just duplicates data and resends it with different addresses. How data is multiplexed is controled just by maps inside that switching node and those can be configured at load time and changed at run time. You can dynamically route data whereever you want. This node can be used for broadcasting a single stream to multiple locations, publish/subscribe designs, and funneling (merging multiple incoming streams to one output stream. Imagine the fun you will have when you can just cross those streams and not be scared of doing it! Email as Dataflow Here is another good analogy for you to absorb - dataflow addresses are very similar to email addresses. They are globally (within your dataflow system) unique and have a hierarchal naming space. Email addresses have the primary 2 parts of name@example.com but the domain can have subdomains (you don't see that much now but in earlier mail history it was more common). The subdomain would locate which mail server to use when inside the top level domain's network. In dataflow an address can be used to locate a destination node but also a sub-part of that address could identify an individual node in a set of related ones (which share a higher level address part). Old Fashioned Threaded Code I bet you didn't know that the term 'threaded code' was in use before kernel and language threads became popular. And the concept is still widely in use but just not called threaded code. The old usage was where your main code body is a set of subroutines and the logic flow was just a sequence of calls (the threading) to those routines. This was something easy to code up and to generate the logic flow. It is still used by many interpreters but they loop over the op code tree instead of directly executing the main logic. I bring this up because later on this concept is used to handle asynchronous flow control. High Isolation The sections on module strength and coupling emphasize module independence. Dataflow modules strive to reach that goal by not having direct calls and not loading up modules which aren't needed. This gives code high isolation from other code which is a big win as it makes for very loose coupling. A given module doesn't know or care about outside code so it can be designed, maintained, rewritten, debugged in isolation which makes for a much easier to manage system. Dependency Heaven I am sure you know about dependency hell. It is where a module or application loads what seems like half of CPAN and is delicately balanced on that stack of code. One mistake installing down below and the whole stack comes crashing down. The length of the CPAN installation output feels like it is in gigabytes! Not fun. Well, dataflow creates a dependency heaven instead. By eliminating the inheritance and direct calls to modules, you don't need to install the world just to get something installed and running. Dataflow modules are designed to be independent of each other and you only need to install and load the actual modules you use and not everything that could be used. This should give you a warm and fuzzy feeling you have never gotten before from software. Interpose at Will Here is another painful scenario. You have a call stack and you want to insert some more processing (e.g filtering, formatting) at a certain point. Of course this is doable but if you use another sub for this you are just making your call stack even bigger and harder to manage. The code you write to insert this new code is custom and not reusable. Your testing code for this subsystem will likely need as much changing as the code itself. And as before, wait until you need to do this in multiple places. As with the switching/multiplexing node covered elsewhere, you can easily interpose a dataflow node between any others in your network. The surrounding nodes don't need to change any code and your testing system may need some minor changes. You can even have two versions with and without the extra processing just by having two different configurations. Or you could have the extra code switched in and out by using that switch node. This is one of the essences of dataflow, total control over what data goes where. Nodes do work but where data flows is controlled by the configuration. Unit Testing When you have highly isolated modules, it is much easier to to unit testing. You don't have to worry about other code in the stack affecting the module under test. You don't need to load up the whole stack to test this module. You can do more comprehensive testing because only the code in the module under test is being run and you could even do a full coverage test (if it is Goldilocks sized! :). Dataflow modules also can be tested without much scaffolding (no mock nodes or complex testing structures). What you do is create a specific module to test this module. The testing module can generate and receive all the data received and sent by the module under test. A simple configuration needs to be created that loads the testing and tested module with addresses that connect each to the other. The testing module generates data which is sent to the tested module. Data sent back is received and checked for correctness. The classic ok/not ok (TOP) protocol can be used too so the standard Test::* modules work too. Integration Testing Testing a subsystem of a large complex application can be a very difficult process. The big problem is finding a clean boundary which encompasses the subsystem and creating a testing structure that can drive it and check for correctness. As mentioned elsewhere (in Interposing), if you need to change that boundary (adding new features, layers, etc.) then it can become a major pain. With dataflow, you can select any subset (connected by addresses) to be a subsystem under test. The same concept used in unit testing can be used for a subsystem. All you need is to know what data is being sent to and received from the subsystem and you can create a testing module for it. Just as with unit testing there needs to be a configuration that both loads the subsystem modules and also the testing module and sets the addresses so the subsystem is only communicating with the testing module. Adding more features or layers to the subsystem is easier. You can even leave the existing testing module and its configuration as is but clone them and edit them to load and test the new subsystem. You can choose any boundary in the full application to be a subsystem under test - that alone will make some of you very happy. Goldilocks Sizing Yes, I said a fairy tale would be part of this and it is a pretty good way to teach this point. A crucial element when designing software is how large should some element be? An element can be a sub, a loop, a module or even a subsystem. The size is related to the level and to other elements at that level. We have all seen monstrously large systems which will engulf half of CPAN. Then there are loops and subs that are pages long or maybe 1 line doing too little. Size matters (in code). The real question is how can you tell when something is the right size. One way is to see if the feature set is solid and does what is needed and no more. Is the element generic or custom oriented? You want to keep generic ones smaller as they are more easily shared. Some of this will only come with experience and reading lots of code. And it will require discipline to keep a module the right size once you have gotten it there. A case history of a major failure will illustrate this principle. I consulted at a firm which had a complex application with many variations and combinations of possible features for each customer. They were writing custom code for each situation. I proposed they analyze all the variations and factor out any common things (which were plenty IMNSHO). Then they could write a threaded code or interpreter to execute the common stuff as well as a way to call custom code. They said they had tried that already. But the coders kept adding stuff into the generic code to make it custom for a user. They broke the goldilocks principle and now they still had massive amounts of unshareable custom code but in large individual operations in this system. The new system was not winning then as it was just calling complete custom code for any situation. The showed no discipline in keeping generic code generic. The had an epidemic of creeping featuritis. The goldilocks principle is meant to guide you in keeping things the best size for their purpose. But you have to be on guard lest you sleep in the wrong bed or eat the wrong porridge. Dynamic Configuration Remember earlier when I said that a configuration is just a data structure? Well, this is where it pays off in gold. You can auto-generate a configuration and start a dataflow application with it. A configuration can be pushed to a remote system (via dataflow of course!). Or a remote system can boot a simple configuration which them requests a configuration from elsewhere in the system. Your system can dynamically change as its load and needs change. You can even add new configuration to a running dataflow application and have it just add those on the fly. It could change things or just add services. Dynamic configuration in dataflow is like is a weather control system for your compute cloud. Here is another idea - say your live application has had its configuration changed since it was started up. It is easy to dump and store that existing configuration and use it at some future time when you restart the application. Plugins are Not Allowed A plugin is a set of subs/modules which use a common API and the main code will select one of them to call or use based on some key. The simplest plugin is the classic dispatch table - a hash of keys with code references as values. The module level plugin is commonly done with OO and the different modules offer the same methods. This is also called polymorphism. So why does dataflow not allow plugins? Well, because in dataflow most every node/module is already a plugin. Since they already use the UAPI, you can select from a set with similar services. Choosing which one to use can be done in a hash or in a multiplexor node (link to section). So it is more a case of not needing a plugin layer or special wrappers to support plugins. Perl6 and Dataflow Language vs. Language No, I am not going to bring up any language wars. I will state that dataflow is a way to end those wars! We all know how great CPAN is but sometimes someone using another language than Perl comes up with a cool module. Hell, it might even be in Perl 6 (which is supposed to one day be able to run Perl5 code)! Anyhow, there are ways for some languages to be embedded and called by another. No big whoop there. But just picking up a foreign module and plopping it into your system is not something you just do whenever you want to. There may be special wrapper code you need to write to match the calling style or do data conversion or some other considerations. Of course you can just bolt that module onto some messaging backbone like rabbit mq but then you require that module to be in another process. That isn't really using multiple languages in one process anymore. With dataflow this become much simpler. Since each language can have its own library to interface with the core dataflow system, the issue of calling and handling data is done once. Any module then written to use dataflow in either language can be used in one system. The only other issue is runtime and if both languages can be compiled to the same engine such as JVM, then you are in business. You now can mix and match any dataflow module in any supported language. They can loaded in the same process, run different boxes or be across the net. This is real language independence. Everybody Wins! Multiple Development Layers module api design implementing modules creating applications Deconstructing grep To illustrate how you can take a larger application and break it down into dataflow components, I will deconstruct the well known grep utility. Grep loops over lines and runs a regex against them. It has (too) many options and we will only cover a few of them here. These components are very small and you wouldn't do this in a real project but they are good for illustration purposes. The first component is a line extractor (I am skipping how the file is read into a buffer or doing infinite streaming). It has a buffer of a file and it just pulls out the next line and sends it to the next section. It adds some information to the line such as the line number, and the file name so a structure is sent and not just the line. This component could easily be reused in any other application that needs line by line processing. The next component is the regex tester. It is configured with the regex (and it could be changed at run time!). All it does is test the line against the regex and adds a boolean to the data structure. It can also add the offsets of the match (if it matched) to that structure. It sends this augmented structure to the next component. This component is the actual filter itself. It is configured with a boolean that says to pass matched or unmatched lines (like default or -v option in grep). It can be configured to also do context by storing lines before and after a match and adding those to the structure. Another component can handle the option to only output file names which had a match. As an optimization it can send data back to the line extractor to stop processing the current file and move on to the next one. This is a simple example of a feedback loop which is easily created with dataflow. Another issue not addressed is throttling and buffering. This can be handled by another feedback loop or even by letting the system buffers do the throttling. As this is not a real world example I won't go further here. Those components are small, very focused and easily written and tested. Some can be reused in other applications and you can quickly replace then in an existing application. At run time you can even change behavior (if you design your components that way) which is almost impossible with a monolithic utility like grep. When testing the actual grep utility all the possible combinations of options and features need to be covered. This is a pretty difficult problem to solve and it shows why breaking something complex down to simpler components is a big win. The key here is each component is isolated from the others and can be developed and tested independently. If you modify the grep utility you may not realize what ramifications it has but when you modify a simple isolated component, there are fewer possible issues to deal with. Solution to Event Loops and Threads I am glad you made it to this part where I reveal my design for integrating event loops and threads. The key is a pipe that is used only inside the single process (pipes are almost always used between processes). On the event loop side of the pipe, you create a request in a data structure and then write the pointer to that structure into the pipe. The other side has a farm of threads which read a whole pointer from the pipe (you could create a pipe per thread too). One thread will read the pointer, take the data structure and do the request. It can block all it wants now. When it is done, it puts the results into that structure and writes the pointer back into the pipe from the thread side. On the event loop side, that pipe handle is tracked with a readable event. When readable, that means a result pointer has been sent from a thread. It gets read and a callback is made to the original requester which then processes the results. So now you can have an event loop that sends blocking operations to a thread and it can continue running. A Biblical Parabola vs. A Quadratic Story In the beginning there was a void called Jersey wherein lay the labs of bell. Dwelling in the darkest dungeon there were two intelligent designers. They had a kernel of an idea, which was the idea of a kernel! And they called it UNIQUES because it was the unique kernel that was multiplied over the 9-track highway. And the kernel users saw this was good and rejoiced. The UNIQUES kernel begat init(1) which then begat all the known processes in the world including a protective shell. And the users gave their commandments to the kernel and they were executed and obeyed. some users got tired of saying the same commandments over and over and so they engraved as scriptures on papyrus cards and disks. Today, many of those commandment scriptures are written on tablets. Eons (weeks) later the users whined to the intelligent designers: We want our process children to be a family and not so separated. The intelligent designers put that whine into a pipe to smoke it and they found that putting data into a pipe was even better! Thus a simple form of IPC and one of the most popular UAPI's was creat'd (sic). The users smoked a lot of data in pipes for generations and really rejoiced. Then demons arose the in the kingdom of Berkeley who wanted to control their daemons. One hippie among them wanted some data sent to him and he said "Sock it to me!!". And that is how sockets were born! Larry sed, with a lisp, "'Grep' and 'awk' are the sounds i make when I C data flow. So he munged all that together and made the data munger he called Perl. In Conclusion If you are interested in talking about dataflow, please contact me. If you have a dataflow type project, contact me today! If you have funding for a dataflow type project, contact me yesterday! I am available for Perl work so let me know if you have any openings. Uri Guttman uri@perlhunter.com 781-643-7504