Storing Realtime IoT Timeseries Data in Python — Timescale DB + Raspberry Pi + Phidget Sensors
In my last story, I introduced a thorough description of how to start a DIY IoT project in Python using a Raspberry Pi board and Phidget Sensors. Based on that simple sensing program, anyone can start building their own smart environments. However, the true power of IoT can be realized when a large collection of various data from multiple sensors are accumulated and we analyze them to infer higher-level knowledge. For example, we can develop a simple air conditioning service when we have a controllable air conditioner and a simple temperature sensor. However, it can be a waste of energy if the air conditioner is turned on when I am about to leave the home, which cannot be anticipated only with the single temperature sensor.
To overcome this limitation and achieve better situation awareness, multivariate timeseries IoT data have been taken into account. For example, application of machine learning techniques such as sequential pattern mining and neural networks to timeseries IoT data can help to recognize/predict human activities with a high recognition/prediction accuracy [1–4]. The multi-dimensional spatio-temporal data is much more helpful and powerful than a unimodal sensor data for detecting meaningful correlations with respect to a given target task. Based on this advanced feature, it becomes possible to determine whether a smart home resident will leave the home soon and prevent the air conditioner from wasting unnecessary energy.
Many machine learning engineers or researchers these days assume or expect that a well-structured datasets are supposed to be given for their tasks. However, I want to say that it is a naive expectation, especially in IoT domain. Most commercial IoT applications are about people. Their health, their daily activities, their routines, or their future schedules. This inherent characteristics (i.e., privacy) makes it hard to publish the data even for the public good. For example, in smart home domain, to the best of my knowledge, only CASAS project led by Diane J. Cook at Washington State University has published a large set of smart home IoT data, which has been regarded successful by domain experts.
Under these circumstances, if you want to build your own IoT applications powered by machine learning algorithms, the first thing you need to do is collecting your own dataset. In this story, I want to talk about how we can do that. How can we collect realtime IoT data from various sensors and store them as a well-refined timeseries data? Which practices or frameworks are ‘decent’ (I don’t want to say it ‘best’) solutions to take?
With the same hardware and software setups of my last story, I will describe a step-by-step instruction on topic belows with sample codes and terminal screenshots:
- Architecture overview — Configuration
- Timescale DB setup
- A Python server to store IoT timeseries data
2. Architecture Overview — Configuration
The figure above briefly illustrates the hardware and software configuration which is used for the instruction in this story. Technically, this is not an ‘Architecture’, but I could not find a better term for the moment. You can use the same configuration or replace some hardware by those you prefer. For example, I used my macbook for the timescale DB server just because it has a Unix-based terminal, I have an old spare one, and I don’t want to run my main MacBook 24/7 as a DB server. It can be replaced by any Linux machine if you have one. But be careful. Timescale does not support Arm-core yet. So please check it out if your machine is with Intel-core.
3. Timescale DB Setup
When you get your hardware and base platforms (e.g., Unix/Linux, Python3, Phidget, etc.) ready, we can start the database setup. Wait, before jumping into the installation and setup, let me briefly explain why I chose Timescale for my database software.
I am sure that the first database software most Python developers would think of when they start development is MySQL. Or, there are many free alternatives you can use. However, there are some database software whose operations are optimized for time series data. As far as I know, InfluxDB and TimescaleDB are the representatives. You can find multiple articles (such as this one) which compare those time series database softwares. You may prefer another one, but I chose TimescaleDB because it supports SQL language natively and many articles say that it outperforms InfluxDB. Since the following instruction is about how to develop a python server which works with TimescaleDB, you may want to leave this story at this point if you do not want to use Timescale as your databased software.
Okay, if you are okay with Timescale, let’s start our databased setup. If you visit Timescale’s website, you can find they provide a series of tutorial materials starting from the installation. However, I found it hard to follow step by step because some of the webpages are disconnected and the contents are too complex regarding our purpose in this story. So I want to re-organize them in this section to make you follow at ease and achieve our goal.
The first thing you need to do is installation by following the instructions in this tutorial. (Note that the installation requires you to install xcode if you use Macbook as your server machine.)
Once you are done with installation, it is time to setup the database. For the database setup, I will basically follow the instruction on the timescale website, while changing the details so that it fits our main purpose. If you completed the installation as the webpage asks, you can successfully connect to PostgreSQL instance with a user name, ‘postgres’:
psql -U postgres -h localhost
Then, create a database to which we will store our IoT timeseries data:
CREATE database homedata;
PostgreSQL is the base of TimescaleDB, and we did not connect to TimescaleDB yet. Which means we need to extend the current database with TimescaleDB in order to use TimescaleDB’s features:
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
Now you can connect to our TimescaleDB instance, homedata, from the terminal:
psql -U postgres -h localhost -d homedata
Even though we used ‘psql’ command in terminal, now we are connected to our TimescaleDB instance.
With our new database, it is time to create a database table which stores the actual data and returns them when we ask for. The structure of the database table should be matched with the list of IoT data you are supposed to generate with your hardware configuration. In my hardware setup, five Phidget sensors are attached to a raspberry pi board via a Phidget VINT HUB:
- 2 Sound sensors (each of which generates 10 values corresponding to 10 different octaves)
- 1 Temperature sensor
- 1 Humidity sensor
- 1 Precision light sensor
To store these sensor data in my timescale db, I create the database table as follows:
CREATE TABLE sensor_data(
db_insert_time TIMESTAMPTZ NOT NULL,
room TEXT NOT NULL,
data_gen_time TIMESTAMPTZ NOT NULL,
sound1 REAL NOT NULL,
sound2 REAL NOT NULL,
temperature REAL NOT NULL,
humidity REAL NOT NULL,
light REAL NOT NULL
In the CREATE query above, I separated the time when a data is inserted into the database (db_insert_time)from the time when the data is actually generated (data_gen_time). This is because the causal order of the data generation can be messed up when multiple sensor clients transmit their data simultaneously. With respect to future extensions, I added an index ‘room’ which indicates the location the sensor data is transmitted from. Considering that each sound sensor generates 10 sensor values at a time, their data is set to be stored as an array of real numbers. Note that Timescale DB supports storage of ARRAY type data. To see more details about the data types which can be stored in PostgreSQL, you can visit this website.
By using \dt command, you can confirm that the data table is successfully created. By using \d command with an argument of the table name, you can verify if the database table’s columns are established as you intended:
The last step is to transform the table into a hypertable with create_hypertable command:
SELECT create_hypertable('sensor_data', 'db_insert_time');
Congratulations! Now you are ready to insert IoT data into Timescale DB! We can start developing our Python server program which accepts a collection of IoT data and stores it into the database.
4. A Python Server to Store IoT Timeseries Data
Before starting the python programming, please note that you can download my sample code from my Github repository. After cloning the repository, checkout server-client-timescaledb branch. In the branch, you can find the sample source codes for the server and the client python programs. Once you customize the config files to your local environment, I expect you can run the programs in your machine. Otherwise, even if you get any errors due to the environmental difference between mine and yours, you will quickly figure out how you can exploit the necessary python libraries to achieve our goal.
I put a Github gist below for the server program code:
Assuming that you can download and check my sample code, I will skip too much details in this story article.
The python server program that we are supposed to develop in this story is composed of two main building blocks:
- TCP socket programming to receive IoT sensor data from sensor clients
- Database connection using psycopg2 python package
I believe you can catch up the first part without any problems if you are familiar with Python programming and have experienced socket programming. The more important part in this article is the second one. psycopg2 python package allows you to query the database in Python. To learn more about its use, I suggest you visit these websites:
- Psycopg 2.8.5 documentation (link)
- “Python PostgreSQL Insert, Update and Delete table data to Perform CRUD Operations,” PYnative (link)
Supposing that you have downloaded my sample code and and taken a look at the package documentations above, you will be able to easily find that the sensor client transmits the sensor data in a JSON format and the server parses it before inserting transmitted data into the database, rather than storing the JSON object as it is. I made this design choice to fully exploit TimescaleDB operations optimized for timeseries data while at the same time sampling the large dataset with an SQL query: It must be more convenient than fetching data from the database and sampling the fetched data in a Python program separately. And the choice of JSON format is because it is regarded as a standard in IoT data exchange.
If you replace my SQL query by your own so that it fits your sensor data and database table, you will be able to see that the transmitted IoT data is stored successfully into the timescale database. You can check it in the timescale database by means of SELECT query:
SELECT * FROM sensor_data;
If you collect a large collection of IoT timeseries data, you can leverage Timescale DB operations to fetch them using psycopg2 package again and analyze them by applying multivariate timeseries analysis algorithms or appropriate machine learning techniques.
In this article, I dealt with how to setup Timescale DB for storing IoT timeseries dataset and how to incorporate Python client and server programs with the database. I did not put enough explanation about my Python sample codes, but considering the simplicity of them and leveraged packages, I hope they can help you to understand the whole procedures. In my future stories, I am planning to deal with how to apply machine learning algorithms to analyze associated patterns with the stored timeseries data.
 Bourbia, Amine Lotfi, et al. “Temporal dependency rule learning based group activity recognition in smart spaces.” 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). Vol. 1. IEEE, 2016
 Alam, Mohammad Arif Ul, Nirmalya Roy, and Archan Misra. “Tracking and Behavior Augmented Activity Recognition for Multiple Inhabitants.” IEEE Transactions on Mobile Computing(2019).
 Krishna, Kundan, et al. “An lstm based system for prediction of human activities with durations.” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1.4 (2018): 1–31.
 Wang, Tinghui, and Diane J. Cook. “sMRT: Multi-Resident Tracking in Smart Homes with Sensor Vectorization.” IEEE Transactions on Pattern Analysis and Machine Intelligence(2020).