How to move data from MSSQL to ElasticSearch at galactic speeds

Shortly after the one billion mark has been reached with my side project I got a question from an old classmate whether I had ever heard about ElasticSearch. I did, but I’ve never had made any time to dive into it any further. This would be a nice moment to dive into it a bit further. In this post we’ll dive into the data feeding process I used to move some data out of MSSQL and into ElasticSearch.

At first I was a bit skeptical about nosql document storage. I heard good things about the performance of ElasticSearch and I’ve seen and loved Kibana for quick data visualization experiments. So lets jump in.

This post is inspired by / based on the post at https://instarea.com/2017/12/06/heavy-load-ms-sql-elasticsearch/ yet with a bit more code to get you up and running quickly.

We’re assuming that both the MSSQL and ElasticSearch databases are running on localhost. What we’re going to do in a nutshell is to export data from MSSQL to a JSON file and import this file in ElasticSearch.

Preparing the source

Retrieving the data Since MSSQL 2016 it’s possible to export data directly from the database engine. That’s exactly what we’re going to do. There’s a few catches though.

1. The documents you want to import should be separated by a newline
2. Each document you want to import should be preceded by a command

The following type of command can be used to retrieve data in almost the correct format:

SELECT (SELECT
dbo.ToDateTime2(Ticks) AS 'timestamp'
, Location.Lat AS 'location.lat'
, Location.Long AS 'location.lon'
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
FROM dbo.TableName

Every row you select will be a JSON object, or soon to be ElasticSearch document.

Exporting numeric values with a certain accuracy

When exporting numeric data MSSQL is usually trying to be overly accurate (Displaying doubles as 4.782966666666670e+001 for example). ElasticSearch can parse this as numeric data, but it is unable to parse these values as geo_point values. Besides this it makes the exported JSON file way bigger then it needs to be.

It is recommended to use the FORMAT function to select the number of decimal places you want to show. E.g. FORMAT(Location.Lat, ‘N5’) to retrieve a value with an accuracy of 5 decimal places.

You can store the result set from the SQL query in several ways. One way is by copy / pasting the results from the query window in SSMS or whatever your SQL editor is. Personally I find it more convenient, especially with large datasets, to use the bcp command in order to directly output the results to a file. bcp can be used in the following way:

bcp "YOUR SQL QUERY" queryout ./output.json -c -S "SERVER_LOCATION" -d DATABASE_NAME -U "DATABASE_USER" -P "DATABASE_USER_PASSWORD"

Every document which has to be indexed by ElasticSearch should be preceded by a command, though. We want to index the documents, but there are plenty of other commands you can use. See here for a list of commands.

We just use {"index":{}} as a command which tells ElasticSearch we want to index the document. We’ll define the index and type later on while indexing the document.

In order to precede all documents with a command the awk command can be used:

Importing the data

The resulting document is ready to be imported into ElasticSearch. Depending on the size of the output you might want to chop up the file in bits in order to make sure the bulk import still works.

The following script as found here can be used to chunk up your file and import the data into ElasticSearch. Please note you have to make the following changes to the document:

• The filename which contains the data you want to import. In this case output.json.
• The index name in the URL ([INDEX])
• The type name in the URL ([TYPE])
#!/bin/sh
LINES=400000
split -l $LINES output.json for xf in$(ls | grep x..$) do curl -o nul -H 'Content-Type: application/x-ndjson' -XPOST localhost:9200/[INDEX]/[TYPE]/_bulk --data-binary @$xf
rm $xf done Hint: You can save the script above as import.sh. Execute it by either running sh import.sh or bash import.sh. I think either will work. I keep forgetting how to run scripts all the time. How fast is it? This fast. Netdata started complaining about the speed the (almost empty 2tb) drive was being written to. Without joke. It was fast. The file size was about 4.8gb, containing 17,075,262 documents. See the netdata screenshot for the performance hit it took. A rough scan shows the import began at 20:23:50 (without slicing the file, that is), and was done at 20:27:15. Which is 205 seconds. Dividing 17,075,262 by 205 amounts to a little more then 83,000 documents each second. Wanna know about another piece of magic? Check the index size. corstian@db:~$ curl -XGET "http://localhost:9200/_cat/shards?v"
index     shard prirep state         docs   store ip        node
.kibana   0     p      STARTED          2   6.8kb 127.0.0.1 VnzjBHy
positions 2     p      STARTED    3415189 478.9mb 127.0.0.1 VnzjBHy
positions 2     r      UNASSIGNED
positions 3     p      STARTED    3415823 477.7mb 127.0.0.1 VnzjBHy
positions 3     r      UNASSIGNED
positions 4     p      STARTED    3414075 476.3mb 127.0.0.1 VnzjBHy
positions 4     r      UNASSIGNED
positions 1     p      STARTED    3413247 478.2mb 127.0.0.1 VnzjBHy
positions 1     r      UNASSIGNED
positions 0     p      STARTED    3416929 479.4mb 127.0.0.1 VnzjBHy
positions 0     r      UNASSIGNED

There are five shards with around 480mb of data, which is about 2.4gb. I kid you not, that’s just half the size of the file we’ve just indexed! Ofcourse there’s the overhead in the indexed file of the commands we added but still. That’s truly amazing!

Achieving one billion

Over the last years I’ve been working continuously with large data sets. Whether it’s about air quality, aircraft movements, plant growth or weather information. I love to process ‘pretty big data’ as I call it. Putting this data into data stores, processing it as fast as possible, preferably real-time or near real-time, and building cool applications on top of this data.

Yesterday is the day that one of my projects passed the ‘magic’ 1 billion record count. Lets look at all the stuff that got out of hand so badly that this could’ve happen.

Fiddling with software

A really long time ago when I was just a little kiddo, possibly somewhere about 14 years old, I was playing with some PHP websites and scripts. Figuring out how things were working, trying to write some new code, and trying to get the computer to do what I was trying to achieve. It was at this time that I was trying to write a contact form which should’ve send the input in the form as email to my own mailbox. I’m not sure what I did wrong but in one way or another the script to send the mail got stuck in an infinite loop and started continuously sending mail to my own email address, which was hosted at my father’s company.

Five minutes later my father called me. They got a call from the ISP that they (me) were taking down their server(s?) with a mail bomb. These emails caused carnage everywhere in between my little web server and the receiving mail server (at my father’s company). In the timespan of only five minutes I ended up overloading a few mailservers (which were probably incorrectly configured either way), and about 80,000 emails in my own mailbox.

That’s only 25 mails per second.

This was the last moment ever I underestimated the power of computers.

Around the same time that I set off the email bomb I also learned about databases. Well, I knew there were tables, and I knew I could store and retrieve data with them, but don’t you dare ask me about foreign keys, indices and other technical details. What did I know? When a classmate showed me a database which contained a table with about 40,000 records I was amazed. How could one get to fill a database with so much information? I could not imagine I was ever about to create a database so big.

A while later I met some people who wanted to build an online platform which would help consumers find the perfect car for their needs based on a set of questions. For this system to work we needed a bit of data about different cars so I wrote a tool to scrape some websites and put this information in a database. After letting this scraper run for a few nights we ended up with information about 400,000 different types of cars. Note I still did not know about foreign keys and all that stuff, but the numbers were getting bigger.

Growing bigger

When I was 17 years old I started writing software at my father’s company. It was this time that I wrote the base for a system which now processes information from about 1000 IoT like devices in real-time. It was at this time that I started learning about database internals because I had to. I needed it to keep the response times in an acceptable window.

I remember searching around on google about experiences people had with large data sets in order to be able to estimate the amount of resources needed. I was worried the whole thing would burn down as soon as 13 million records were processed. These numbers felt so big that I could not possibly imagine how much resources were needed to work with it. And amazingly it continued to work to this day, with 16 million records and counting.

… and bigger

These days my biggest side project is about processing flight information from glider aircraft. When I started one and a half year ago I had no idea about how fast this project was about to grow. In less then a year the tool processed one billion data points in real-time, sometimes peaking at about 20,000 points / second, amounting to millions of points each day.

And then there are the learning moments in this traject. Like trashing a live database with 300 million records, and not having backups. The moments I’ve been trying to debug and fix the data processing beast without causing any downtime, and all moments I was busy doing performance optimization and I did not think I was going to make it. It’s funny how both the idea and the program evoluted at the same time. In the beginning I just wanted to process a small amount of data, and it took 10 seconds to process 7,000 records. Right now the same amount of information can be processed and distributed in just shy of a quarter of a second.

And this way the project starting to grow, and continued growing, regardless of the moments I just wanted to trash it because I did something stupid, or the moments I did not have any energy to continue working on it, or because of any other reason there was to stop.

And now it contains 1,000,000,000 data points. Actually I’m more proud about this achievement then I care to admit. It feels like an amazing milestone for me personally 🎉.

A quick calculation shows that these one billion data points represent more then 60 full years of experience flying aircraft in all kinds of conditions. This is more then 20,000 full days.

What’s next?

Data alone is not useful. There is currently so much data stored that no one will understand the context of the data on itself without proper aggregation. One of the biggest goals for the future is to streamline the data aggregation and information distribution processes.

My goals are the following:

• Develop or set up an information processing pipeline which allows for rapid development of new data processing steps
• Develop API’s so that this information can be shared with the world
• Apply machine learning to this data set in order to be able to predict several things, including thermalling hotspots based on the weather conditions
• Develop a proper user interface and attract pilots to use this platform for their flight analysis needs

The goals above will probably take several years to achieve. Nevertheless, on to the next 10 billion data points and amazing data processing tools!

Scaffolding an existing SQL database with Entity Framework Core in 5 minutes

Sometimes it’s nice to get a break from ‘legacy’ software. In this case we would like to get started using Identity Server 4 with an existing database running on SQL. Wouldn’t it be nice to get up and running in a few minutes? Hold on.

We’re using the dotnet cli for speed, and cross platform usefulness (OS X, Windows(?) and Linux). We assume you’ve booted your favorite terminal and you are in your solution folder. Buckle up buddy!

mkdir <YOUR_PROJECT_FOLDER>
cd <YOUR_PROJECT_FOLDER>
dotnet new classlib

After these packages have been installed we need to add the following two lines to the .csproj file in the current folder. These are required in order to use the Entity Framework tooling from the command line. Use your favorite text editor:

<ItemGroup>
<DotNetCliToolReference Include="Microsoft.EntityFrameworkCore.Tools.DotNet" Version="2.0.0" />
<DotNetCliToolReference Include="Microsoft.VisualStudio.Web.CodeGeneration.Tools" Version="2.0.0" />
</ItemGroup>

Ps. make sure these 4 lines are somewhere within the <Project> tags, or grouped with other <DotNetCliToolReference> tags eventually already in your file.

In order to be able to run migrations from this project we will configure it to be able to act as being startup project. There should be a line which is:

<TargetFramework>netstandard2.0</TargetFramework>

Change it to:

<TargetFrameworks>netcoreapp2.0;netstandard2.0</TargetFrameworks>

Just copy paste it for your speed, and sanity. Now you’re ready to scaffold your data model from the database. In order to do so:

dotnet ef dbcontext scaffold "<CONNECTION STRING>" Microsoft.EntityFrameworkCore.SqlServer

In case you get an error about the framework versions you need to install, just determine the current version by running the dotnet --info command and grabbing the value from under the Microsoft .NET Core Shared Framework Host line. Next add the following tag just under the <TargetFramework> tag and you’re good to go.

<RuntimeFrameworkVersion>2.0.5</RuntimeFrameworkVersion>

Congratulations! You should now have your existing data model ready to use within your .NET Core application using Entity Framework Core :)

Visualizing airspace usage by glider aircraft

In the past year I have done a significant amount of work to process flight information from glider aircraft. For reference, in 2017 I have processed about 1,000,000,000 position updates and at the end of the year I had a database which stored about 700,000,000 data points, which represented over 30,000 unique aircraft. One data point contains information like latitude, longitude, altitude, speed, heading, climb-rate and turn-rate.

Funny thing; the difference between the amount of processed points and stored points came from beginning of the year, when I accidentally deleted the database, containing 300,000,000 data points, during a migration. Oooops. Should’ve made proper back-ups!

A while ago I decided to export just a little bit of data to look into in detail. The data export contained all data from Sunday 13th of August, 2017, on which the 19th edition of the FAI European Gliding Championships has been held from Lasham Airfield.

To get an idea of the sheer size of the event, check out the video from the British Gliding Team above. In order to get an idea of the glider activity, on this day alone, check the image below.

Please note that on this day alone about 14,000,000 position update have been recorded. Considering an aircraft sends out an position update about once a second that means a little less then 4000 hours of flight activity has been recorded, on this day alone.

I managed to create a quick visualization of gliders coming back to Lasham airfield starting at around 15:30Z. See below!

The things I’m displaying here represent only a small portion of the work which has been done, and an even smaller portion of the full potential of this data in combination with the right tooling. Although this by far isn’t the end result I’m wanting to achieve I’m already quite proud on this humble result! More and better stuff will definitely follow in the future! Stay tuned for further updates! :)

.NET core application not logging output to Docker

Since the introduction of .net core 2.0 everything on the .net platform seems to become better and better. Since the introduction of Visual Studio for mac it encouraged me to switch over to a mac OS X only dev environment.

Needless to say my current development environment with Visual Studio, Docker, and Kitematic works amazing. And I only just started figuring things out.

The thing which has been bugging me though was that it seemed like application output (stdout and stderr) was not being logged to Docker while it was being logged to Visual Studio.

How to get an application to log to Docker

Docker containers built from microsoft/dotnet:2.0-runtime or microsoft/aspnetcore:2.0 have a /remote\_debugger volume which maps to ~/.vsdbg. For now I can only imagine how this stuff works, but I suppose that when the remote debugger is attached the application only logs to the remote debugger which may or may not be favorable behavior.

In case you know more about the /remote\_debugger volume and the ~/.vsdbg thingy or you can point me in the right direction, I’d love to hear more from you! Hit me up on Twitter!