Feed the Monster

Hydra excels at taking massive amounts of data and making it interpretable. It’s flexibility and scalability are really put on display when you have a lot of data, like Map-Reduce in Hadoop. Now that I have a local stack running, I need to feed Hydra in order to play with it, much like the Tamagotchi’s of old. The source I most often see is log files, and so that is where I begin, feeding it a log file. The only logs I generate are application logs from Tomcat.

Forward Thinking

I have a background with Flume, and so as I designed my application, I planned to use a mechanism to enable reviewing the logs. Using Log4J, my logs have consistent formatting and structure. Using Aspect-Oriented Programming, AOP, I generate access logs to see how users are using the application, and generate changelogs for CRUD operations. Using Hydra, I can quickly break down these operations and query them in the future.

Some things to keep in mind when making a job:
  • The configuration is written in JSON.
  • You must save the job before you kick it.
  • Once you kick a job and it runs, you will muddy the data if you change the structure of the job configuration and run again. (Formatting changes don’t matter)
    • You will need to Clone the job and kick it if you need to adjust the map or the output.
  • Log files you can check for issues: $HYDRA_HOME/hydra-local/log/spawn.log, or the minion logs in the same dir
  • If you have more problems, you can restart Hydra:
    • shutdown – ./hydra-uber/bin/local-stack.sh stop
    • start – ./hydra-uber/bin/local-stack.sh start (x2) with a rabbitctl status in between to ensure RabbitMQ has started (doesn’t say nodedown)
  • Best to bookmark Hydra’s Documentation because it has incredibly helpful information

My First Job

Following Matt Abrams’ excellent tutorial, Getting Started With Hydra, I wrote a job to read from a tomcat log. My log isn’t comma-delimited like in his example, so I’ll be showing how to parse the elements out.

There are three components to the job — Source, Map, and Output.

Source

This defines where your data is coming from. For this example, it’s files using the mesh2 DataSource. Notes are within:

{ // the opening brace!
    type: "map", // Must be specified
    source: {
      type: "mesh2",
      // `hash` to `true` says to hash the filenames and "ensures that each unique input file is processed by one and only one task processor."
      hash: true,
      mesh: {
        // Log file location inside $HYDRA_HOME/hydra-local/streams, here a subdirectory of ../streams named `bklogs`
        // take any logs starting with catalina..
        files: ["bklogs/catalina..*"],
      },
      format: {
        // use "column" when source is a delimited file
        type: "column",
        // names for the columns, in my case a single name for the entire line
        columns: ["LINE"],
        tokens: {
          // group everything between quotes as a single data element
          group: ['"'],
          // take everything in the line and set it as the column
          pack: true,
          // overriding the default ',' separator to something not seen in my log - four pipes
          separator: "||||",
        },
      },
    }, // end of the source section
};

Map

This section will filter out lines, and parses each line into attributes which can be used later. A major thing you will notice is the structure:

  • op – operation to perform
  • from and to – get and set for the operation. If to is omitted, the operation will use the from as the to.
  • numerical operations follow reverse polish notation
  • For more info on filters, see http://oss-docs.addthiscode.net/hydra/latest/user-guide/concepts/filters.html
"map":{
    // filter out pieces of data from the output AND manipulate the data as follows
    "filterOut":{op:"chain", filter:[
        // first, get the length of the line and assign it to LINE_LENGTH
        {op:"field", from:"LINE", to:"LINE_LENGTH", filter:{op:"length"}},
        
        // use the contained operations in a chain
        // if it fails in here, the row is skipped
        {op:"chain", filter:[
        // check that the LINE_LENGTH is <= 29, the minimum length for parsing.
            {op:"num", columns:["LINE_LENGTH"], define:"c0,v29,gteq"},
        ]},
    
        // substring LINE using 'slice' from 0 (implied) to 19 and assign to LOG_TIME
        {op:"field", from:"LINE", to:"LOG_TIME", filter:{op:"slice", to:19}},
        
        // check that the LOG_TIME starts with a 2, meaning it's a year
        {op:"chain", filter:[
            {op:"field", from:"LOG_TIME", to:"YEAR_START", filter:{op:"slice", to:1}},
            // equals expects properties not literals, so assigning a value to EXPECTED_YEAR
            {op:"value", to: "EXPECTED_YEAR", value: "2"},
            {op:"equals", left:"YEAR_START", right:"EXPECTED_YEAR"},
        ]},
    
        // parse the rest of LINE to other properties LEVEL and MESSAGE
        {op:"field", from:"LINE", to:"LEVEL", filter:{op:"slice", from:24, to:29}},
        {op:"field", from:"LINE", to:"MESSAGE", filter:{op:"slice", from:29}},
        
        // trim any white space
        {op:"field", from:"LOG_TIME", filter:{"op":"trim"}},
        {op:"field", from:"LEVEL", filter:{"op":"trim"}},
        {op:"field", from:"MESSAGE", filter:{"op":"trim"}},
        
        // transform the LOG_TIME into YMD format
        {op:"field", from:"LOG_TIME", to:"DATE_YMD", filter:{op:"slice", to:10}},
        
        // great for the first job, output what Hydra has so far in the map to the log
        {op:debug},
    ]},
}, // end of the map section

Output

Output the data into a tree. Here you can also define data attachments, often used to count unique values at different points in the tree.

  {
    "output": {
        // we are building a tree
        "type": "tree",
        // the default path is TREE
        "root": { path: "TREE" },
        "paths": {
          // define the TREE
          TREE: [
            // top-most node of the tree is a constant "mylog"
            { type: "const", value: "mylog" },
            // define the branches
            {
              type: "branch",
              list: [
                [
                  // since the next layer is a date, including another constant "ymd"
                  { type: "const", value: "ymd" },
                  // output the date
                  {
                    type: "value",
                    key: "DATE_YMD",
                    // add a data attachment to get the total logs on a given day.
                    // This may be useful in seeing total log activity for each day.
                    data: {
                      // this attachment is a count using the "hll" counter, and will count the number of unique LOG_TIME values
                      total_logs: {
                        type: "count",
                        ver: "hll",
                        rsd: "0.01",
                        key: "LOG_TIME",
                      },
                    },
                  },
                  // output level, and count total messages for each level
                  {
                    type: "value",
                    key: "LEVEL",
                    data: {
                      level_count: {
                        type: "count",
                        ver: "hll",
                        rsd: "0.01",
                        key: "LOG_TIME",
                      },
                    },
                  },
                  // output the message
                  { type: "value", key: "MESSAGE" },
                ],
              ],
            },
          ],
        },
      }, // end of the output section
    }; // close the job config

The resulting tree looks like this:

  • mylog
    • ymd
      • 2014-03-21
      • 2014-03-22
        • ERROR
          • “OH NO! Something is broken.”
        • INFO
          • “Hello World”
          • “Just another log message”
        • WARN
      • 2014-03-23

Viewing Data and Errors

After you save and kick the job, you may want to view the log of the minion executing it. Necessary for understanding errors! To do so, click Tasks tab and click the hostname. The stdout or stderr log will appear, and you’ll have a textfield for setting the number of lines to show.

If it runs fine, then click Query to see the data. More to come on querying!

Troubleshooting

  • I had trouble with the minions not starting the tasks. One error message concerned an inability to find grm (bash: grm: command not found) despite it being installed fine. To remedy, I went to Settings in Spawn and set the Number of Replicas to 0.

This job is better suited to be a split job, with map jobs referencing it, since it is transforming the log into useful pieces. More on that later.

Leave a Reply

Your email address will not be published. Required fields are marked *