InfluxDB Primer

In the previous chapter, we installed InfluxDB using a Helm chart. The default installation was modified in two ways:

  • a small persistent disk was added to store time series data
  • a user was created

In From MQTT to InfluxDB, we will write a small service that forwards data from Mosquitto, our MQTT broker, to InfluxDB. Before writing that service, this chapter explains some of the InfluxDB basics such as databases, measurements, tags, values and more. The InfluxDB cli, influx, will be used a lot but you already installed the cli in the previous chapter. Alternatively, you can get a shell to your InfluxDB container and run influx from that shell:

kubectl port-forward <InfluxDB_pod> 8086

Databases

An InfluxDB database acts as a container for many other objects such as users, retention policies, time series and more. Remember that we created a database called telemetry during deployment of InfluxDB by modifying values.yaml of the Helm chart. In From MQTT to InfluxDB, we will also create the database during service initialization. To see the databases, start influx and make sure you are connected to your InfluxDB instance. Then use the AUTH command to logon with the admin account and whatever password you configured. The SHOW DATABASES command and its result is show below:

> SHOW DATABASES
name: databases
name
----
_internal
telemetry

To create a database, use the CREATE DATABASE command. When you create a database, you can set options such as the retention policy and a replication factor. Because we have a single InfluxDB instance and not a cluster, we need to set the replication factor to one. To create a database with a retention policy of one week and replication factor one, use the following command:

CREATE DATABASE mydb WITH DURATION 1w REPLICATION 1 NAME myrp

This creates a database mydb with a retention policy calles myrp that keeps data for one week as the default. Know that retention policies are set at the database level. To see the retention policy, use the following commands:

USE DATABASE mydb
SHOW RETENTION POLICIES

This results in the following output:

name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
myrp 168h0m0s 24h0m0s            1        true

Note that the myrp retention policy is set as the default.

The database we created during installation (telemetry) has the following retention policies:

name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

When you create a database and you do not specify a retention policy, InfluxDB creates the autogen retention policy. It keeps the data indefinitely. The name of the policy and its auto-creation can be controlled from the InfluxDB configuration file.

If you want to delete the database mydb, use the following command:

drop database mydb

Retention Policies

Instead of creating retention policies during database creation, you can also create them separately. For example:

CREATE RETENTION POLICY 60m.summary ON mydb DURATION 60m REPLICATION 1 DEFAULT

The above command creates a retention policy that removes data when it is more than 60 minutes old. Note that the use of DEFAULT sets the policy as the default for the database.

The duration is set with duration literals like 1000ms, 1s, 10m, 50h, 20w, etc... You can also use INF for infinite. See https://docs.influxdata.com/influxdb/v1.4/query_language/spec/#durations for more details.

When you write data to InfluxDB using a client library or the HTTP API, you need to specify the database you write to. For instance, if you use curl to write data, you would start with:

curl -i -XPOST "http://localhost:8086/write?db=mydb" ...

If you do not specify a retention policy, the default retention policy will be used. If the data you write needs another retention policy, you can specify it like so:

curl -i -XPOST "http://localhost:8086/write?db=mydb&rp=myrp" ...

Time Series

With InfluxDB and other time series databases, data is organized in time series. A time series contains one or more points. Points in InfluxDB consist of a timestamp and a measurement followed by at least one key-value field and zero to many key-value tags. You already got a glimpse of the data we will write in Writing an MQTT client. We write data to Mosquitto in the following JSON format:

{"device": "simulator", "type": "airq","temperature": 25.94 , "humidity": 45.58}

The data we want to write to InfluxDB consists of the above fields in addition to the room. We extract the room from the MQTT topic which is formatted as airq/city/building/room.

If you follow the subsequent chapters, you will have data in InfluxDB like so:

> select * from iotdata limit 10
name: iotdata
time                device    humidity room    temperature type
----                ------    -------- ----    ----------- ----
1517766354000000000 simulator 42.76    asterix 30.06       airq
1517766355000000000 simulator 43.49    asterix 39.82       airq
1517766356000000000 simulator 44.29    asterix 39.52       airq
1517766357000000000 simulator 47.35    asterix 24          airq
1517766358000000000 simulator 49.63    asterix 29.82       airq
1517766359000000000 simulator 41.78    asterix 27.16       airq
1517766360000000000 simulator 49.11    asterix 37.05       airq
1517766361000000000 simulator 46.2     asterix 35.63       airq
1517766362000000000 simulator 47.25    asterix 30.4        airq
1517766363000000000 simulator 46.45    asterix 24.41       airq

The measurement is called iotdata, which admittedly is not a terribly well chosen name. The measurement is a container for tags, fields and the time column. As such, the name of the measurement should describe the data in some way. In concept, the measurement is similar to a SQL table.

So what are the values and tags in the above measurement? Device, room and type are tags and humidity and temperature are values. Use the SHOW TAG KEYS command to find out what your tags are per measurement:

> show tag keys
name: iotdata
tagKey
------
device
room
type

Tags are always of the string type.

Use SHOW FIELD KEYS for the fields per measurement:

> show field keys
name: iotdata
fieldKey    fieldType
--------    ---------
humidity    float
temperature float

Fields contain values of a specific type, in our case float. Other types are integer, string, boolean and timestamp as discussed in more detail at https://docs.influxdata.com/influxdb/v1.4/write_protocols/line_protocol_reference/#data-types.

In InfluxDB, time is always the primary key but tags are also indexed. Field values are not indexed so take care when you write your queries. Doing selections based on field values means InfluxDB needs to scan all the values which is a costly operation.

With this explanation behind us, we can move on to explain what a time series is. Simply put, a time series is a collection of data points that share retention policy, measurement and tag set. Use the SHOW SERIES command to view the series in the current database:

> show series
key
---
iotdata,device=particle_electron,room=desk,type=airqmobile
iotdata,device=simulator,room=asterix,type=airq

You should try to avoid highly unique tag values such as GUIDs because that will lead to series bloat.

Writing a Point

In this section, we take a brief look at the HTTP API to write data. Using a simple HTTP POST, you can write a point as follows:

curl -i -XPOST 'http://localhost:8086/write?db=telemetry&u=admin&p=test' \
--data-binary 'iotdata,device=simulator,type=airq,room=asterix temperature=23.8,humidity=34.8 1517766354000000000'

As shown earlier, data needs to be POSTed to the HTTP API of InfluxDB. In the query string, the db parameter is required. The rp parameter can be use to set the retention policy for the data. With the u and p parameters, we specify the username and password we created during deployment.

The data is posted as binary data and uses the Line Protocol. In short, the line protocol uses the following syntax:

measurement,tag=value,tag=value,tag=... field=value,field=value,field=... timestamp

Note the space between the tag set and the field set and between the field set and the timestamp. In the curl example, we write data to the iotdata measurement using the tags and fields discussed earlier. The timestamp is a Unix timestamp.

If you want to write multiple points at once, specify them on separate lines. Writing points in batch is better from a performance point of view.

curl -i -XPOST 'http://localhost:8086/write?db=telemetry&u=admin&p=test' \
--data-binary 'iotdata,device=simulator,type=airq,room=asterix temperature=23.8,humidity=34.8
               iotdata,device=simulator,type=airq,room=asterix temperature=23.6,humidity=34.9'

In the above example, the timestamp is missing. When it is missing, InfluxDB uses the server's time. You may have noticed that the JSON data we send from a device does not contain a timestamp so we use the timestamp set by InfluxDB. In a production setting, it is recommended to include the timestamp from the device if possible.

Querying

InfluxDB has a SQL-like query language called InfluxQL. If you know SQL you should feel right at home with InfluxQL. The SELECT statement in InfluxQL:

select room, temperature from iotdata limit 2
name: iotdata
time                room    temperature
----                ----    -----------
1517766354000000000 asterix 30.06
1517766355000000000 asterix 39.82

If you only want to see temperature data for room asterix:

select temperature from iotdata where room='asterix' limit 2
name: iotdata
time                temperature
----                -----------
1517766354000000000 30.06
1517766355000000000 39.82

Note the use of single quotes in the above statement. In a query, you might want to include time:

select temperature from iotdata where room='asterix' and time > '2018-03-02' limit 2
name: iotdata
time                temperature
----                -----------
1519982708000000000 21.14
1519982709000000000 37.4

The Unix timestamp for March 2nd 2018 at 00:00:00 (midnight) is 1519948800 so we only expect results with a timestamp greater than that value. You will often want data based on relative time, say from the last hour:

> select temperature from iotdata where room='asterix' and time > now()-1h
name: iotdata
time                temperature
----                -----------
1519992388331546252 23.6

You can use the same time duration literals as in the section about retention policies. You will often want to work with selectors such as FIRST(), MAX() and so on. For instance, to select the last temperature:

> select last(temperature), room, type from iotdata
name: iotdata
time                last room    type
----                ---- ----    ----
1519992388331546252 23.6 asterix airq

Or to select the highest temperature:

> select max(temperature), room, type from iotdata
name: iotdata
time                max room    type
----                --- ----    ----
1517770922000000000 40  asterix airq

To obtain the maximum value of all fields:

> select max(*) from iotdata
name: iotdata
time max_humidity max_temperature
---- ------------ ---------------
0    87.1         40

The timestamp is obviously zero because these maximum values occurred at different timestamps. Like in SQL, you can also work with aggregations such as COUNT(), SUM(), MEAN(), STDDEV() and more. To calculate the average temperature per room:

> select mean(temperature) from iotdata group by room
name: iotdata
tags: room=asterix
time mean
---- ----
0    29.97499972993343

name: iotdata
tags: room=desk
time mean
---- ----
0    23.47867383512529

The above query also illustrates the use of the GROUP BY clause. To finish, let's count the the points:

> select count(*) from iotdata
name: iotdata
time count_humidity count_temperature
---- -------------- -----------------
0    463407         463407

Continuous Queries

In subsequent chapters, temperature and humidity data is saved to the iotdata measurement in the telemetry database. The data comes from a simulator that can be configured to send dummy data at a fairly rapid pace. Although there can be requirements to keep all the original data, you might want to remove data from the iotdata measurement after a few days and keep downsampled data for a longer period. You might, for instance, want to calculate averages over five minute intervals and keep those for a longer period.

To calculate average values over five minute intervals from the original data in the iotdata measurement, you can use a continuous query. The continuous query below makes use of the basic syntax:

CREATE CONTINUOUS QUERY "iotdata.5m" ON telemetry 
BEGIN
  SELECT MEAN(temperature),MEAN(humidity) into average_iotdata FROM iotdata GROUP BY time(5m) 
END

The above query will run at every 5 minutes (e.g. 5PM, 5:05PM, 5:10PM, etc...) and calculate the mean of the values between now() and now() - 5m. The above continuous query will not create the averages for older data. If data is streaming into InfluxDB, you will see the following measurement:

name: average_iotdata
time                mean               mean_1
----                ----               ------
1519998600000000000 29.83490909090907  45.079272727272716
1519998900000000000 29.626899999999974 44.74473333333332
1519999200000000000 30.44486666666666  44.93063333333333
1519999500000000000 30.08263333333334  44.77770000000001

The timestamp 1519998600000000000 is Friday, March 2, 2018 1:50:00PM. The second timestamp is 1:55:00PM. An easy way to convert the timestamp to a readable human date is the site https://www.epochconverter.com. Note that when you copy and paste the above values, you will lose some precision because InfluxDB uses nanoseconds and the site assumes you are using microseconds. Are people still keeping time in microseconds? I mean, keeping track of events down to 1/1000000000th of a second is what you want right?

We have only scratched the surface of continuous queries here. For more information, do visit the Continuous Queries page over at influxdata.com.

Downsampling

When you combine retention policies with continuous queries, you can downsample the data and keep storage requirements in check. To try this, set the default retention policy in database telemetry to two weeks:

CREATE RETENTION POLICY "2w" ON telemetry DURATION 2w REPLICATION 1 DEFAULT

Create a second retention policy that keeps data for 6 months (24 weeks):

CREATE RETENTION POLICY "24w" ON telemetry DURATION 24w REPLICATION 1 DEFAULT

Now, check the retention policies (make sure you are in database telemetry with USE telemetry):

> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 0s        168h0m0s           1        false
2w      336h0m0s  24h0m0s            1        true
24w     4032h0m0s 24h0m0s            1        false

Now, when points are written to measurements without specifying a retention policy, the 2w policy is in effect. Now, we can create a continuous query that calculates the mean of temperature and humidity every five minutes. We keep these means for 24 weeks by specifying the 24w retention policy:

CREATE CONTINUOUS QUERY "iotdata_mean" ON telemetry 
BEGIN
  SELECT MEAN(temperature) AS temperature,MEAN(humidity) AS humidity 
  INTO "24w"."iotdata_mean" FROM iotdata GROUP BY time(5m) 
END

The trick here is to specify the retention policy 24w in the INTO clause. If you want to see the summarized points, be sure to include the retention policy:

SELECT * FROM "24w".iotdata_mean

You need to include the retention policy because it is not the default. In subsequent chapters, we will write points to the iotdata collection. You can use the default retention policy or apply the downsampling techniques described in this section. The examples will work in both scenarios.

InfluxDB Administration

InfluxDB saves internal information about the status of the system in the metastore. The metastore contains user information, databases, retention policies and more. In case of a restore, you will need to recreate this information. You can do so by restoring the metastore. But first, you need a backup:

influxd backup /tmp/backup

The above command creates a backup of the metastore in /tmp/backup. Note that such a backup does not backup your data. To backup your data, use the following command:

influxd backup -database telemetry /tmp/dbbackup

The above command backs up the telemetry database. All retention policies are backed up but you can specify specific retention policies with the -retention flag.

In my case, the backup command resulted in the following files:

/tmp/dbbackup # ls
meta.00                     telemetry.2w.00035.00       telemetry.autogen.00004.00  telemetry.autogen.00020.00
telemetry.24w.00036.00      telemetry.autogen.00002.00  telemetry.autogen.00012.00  telemetry.autogen.00028.00

Note that a database backup also backs up metadata.

Although it is good to know how to backup data, it is even better to know how to restore data and where to restore it. The meta directory is /var/lib/influxdb/meta, which in our case is stored on the persistent disk. The folder contains one file called meta.db.

The data directory is /var/lib/influxdb/data with each database in a folder under that path. The database folder (e.g. telemetry) contains folders for each retention policy:

/var/lib/influxdb/data/telemetry # ls
24w      2w       autogen

Note that, as new writes come in, they are first written to write-ahead-log files with a .wal extension under /var/lib/influxdb/wal. New data files with the .tsm extension are written based on certain conditions. It might be that, while you are experimenting, data is sent to the .wal files and not yet written to the .tsm files under /var/lib/influxdb/data.

To restore a database, use the following syntax:

influxd restore -datadir <path-to-datadir> <path-to-backup>

To restore the metastore, use the -metadir flag.

With this chapter behind you, you should have a good idea about the basics of InfluxDB. In the next chapter, we will build an MQTT to InfluxDB bridge that saves points to InfluxDB as they enter Mosquitto.

results matching ""

    No results matching ""