ZNHOO Whatever you are, be a good one!

Flume

  1. ABCs
  2. Configuration
  3. Starting an Agent

ABCs

flume [c] an inclined channel for conveying water (as for power); a ravine or gorge with a stream running through it.

Apache Flume is a distributed system for collecting, aggregating and moving log data from many different sources to a centralized data store.

  1. Require Java Runtime Environment (JRE) - 1.8 or later.
  2. A Flume event is byte payload with optional string atrributes - also called Flume data.
  3. A Flume agent is a JVM process consisting of three components: sources, channels, and sinks.

    Flume Flow

  4. When a source receives an event, it stores it into one or more channels. The channel keeps the event until it is consumed by a Flume sink.

    Channels are passive buffers between sources and sinks: neither send nor pull events proactively.

  5. Multiple agents can be connected to form a multi-hop events flow. For example, an avro-typed source can recive external events from an avro-typed sink, using RPC.

    Multi-agent flow:

    flume-multi-agent

    Consolidation:

    flume-consolidation

    Multiplexing:

    flume-multiplexing

Configuration

Usually, Flume configuration files end with .properties suffix.

Always keep in mind that, an agent may have multiple sources, multiple channels, and multiple sinks. How these multiple objects are configured to cooperate are important!

  1. Each source, channel and sink has a type property.

    For example, a source type may be exec; a channel type may be memory; a sink type may be avro.

    Other properties are set accordingly.

  2. A Flume source can have an interceptors property. Interceptors are used to tranform recevied events before putting them onto a channel.

    For example, an interceptor may be chosen to translate raw event data into Json format.

    my_agent.sources.r12a.type = exec
    my_agent.sources.r12a.channels=c12
    my_agent.sources.r12a.command=tail -F /var/log/server.log
    my_agent.sources.r12a.logStdErr = true
    my_agent.sources.r12a.batchSize = 100
    my_agent.sources.r12a.interceptors=log2Json
    my_agent.sources.r12a.interceptors.log2Json.type=Flume.log2Json$Builder
    

    Flume.log2Json$Builder is an external Java class.

  3. Multiple sinks can be grouped (sink group) to achieve load_balance or failover from one sink to another, with the help of sink processor.

    my_agent.sinkgroups=g12a
    my_agent.sinkgroups.g12a.sinks = k12a k12b k12c k12d k12e k12f k12g k12h
    my_agent.sinkgroups.g12a.processor.type = load_balance
    my_agent.sinkgroups.g12a.processor.backoff = true
    my_agent.sinkgroups.g12a.processor.selector = round_robin
    my_agent.sinkgroups.g12a.processor.selector.maxTimeOut = 3000
    

    If backoff is enabled, the sink processor will backlist sinks that fail, removing them for a given timeout. When the timeout ends, if the sink is still unresponsive, the timeout is increased exponentionally.

  4. Each source and sink should be assigned to a channel.

    my_agent.sources.r12a.channels = c12
    my_agent.sources.r12b.channels = c12
    my_agent.sinks.k12a.channel = c12
    my_agent.sinks.k12b.channel = c12
    my_agent.sinks.k12c.channel = c12
    my_agent.sinks.k12d.channel = c12
    my_agent.sinks.k12e.channel = c12
    my_agent.sinks.k12f.channel = c12
    my_agent.sinks.k12g.channel = c12
    my_agent.sinks.k12h.channel = c12
    

Starting an Agent

Flume distribution brings in a shell script under its bin directory: flume-ng. We should specify the agent name defined in configuration file.

~ # bin/flume-ng --help
~ # bin/flume-ng agent -c conf -f conf/flume-conf.properties -n $agent_name