Data @ GOTO Berlin 2016

December 16, 2016

We are surrounded by tons of data. Our software is collecting massive data that needs to be put into a form so we can work with is. How do other companies deal with their data? What are best practices, tools and unknown ways?

The Data track at GOTO Berlin 2016 includes sessions with speakers seeking to answer these questions. Watch the videos from the Data track below.

Conflict Resolution for Eventual Consistency

with Martin Kleppmann, Researcher at University of Cambridge

What do collaborative editors like Google Docs, the calendar app on your phone, and multi-datacenter database clusters have in common?

Answer: They all need to cope with network interruptions, and still work offline. They all allow state to be updated concurrently in several different places, and asynchronously propagate changes to other nodes. If data is concurrently changed on different nodes, you get conflicts that need to be resolved.

There are different approaches to handling those conflicts: some systems let the user manually resolve them; some systems choose one version as the winner and throw away the other versions; and some systems try to merge concurrent updates automatically. For example, Google Docs uses an algorithm called Operational Transform (OT) to perform this merge, while Riak uses Conflict-Free Replicated Datatypes (CRDTs) to achieve a similar thing.

In this talk we will explore these algorithms for automatic merging. They start out quite simple, but as we shall see, they soon become fascinatingly mind-bending once you start trying to do more ambitious things. For example, if you wanted to write your own spreadsheet app or graphics software that allows several users to edit the same document concurrently, how would you go about doing that?

Download the Slides

[bctt tweet=”Conflict Resolution for Eventual Consistency – view @martinkl’s slides from his talk at @GOTOber 2016″]

Building a culture of Experimentation at Spotify

with Ben Dressler, Experimentation Lead at Spotify

Running an experiment is trivial: Make a change and see what happens.

Running experiments at scale, however, is a different story.

It is not trivial to simultaneously run hundreds of experiments across 100 million users. It’s not trivial to cover dozens of platforms and markets while staying on top of the technical and methodological complexities. And it’s also not trivial to build a culture of curiosity where people are happy to expose their ideas to a reality check and are not afraid to see them fail.

This talk will touch on Spotify’s approach to scaling Experimentation and lessons learned along the way.

Download the Slides

[bctt tweet=”Building a culture of experimentation at @Spotify – view @bendressler’s slides from @GOTOber 2016″]

Enabling the Panama Papers Investigations with Open Source Tools

with Michael Hunger, GraphAddict at Neo4j

The biggest leak in journalistic history has not only been mind blowing for everyone but also challenging for the team of journalists and developers known as the ICIJ. With more than 11M documents sizing 2.6TB of information, it is truly impressive that a small team of 3 developers could support more than 400 journalists in a year’s worth of investigative work.
This became possible through the efficient use of open source technology for scanning and extracting text and metadata from the documents. The biggest difference though made the power of a graph database to connect the people, companies and accounts revealed in the investigation.

Especially for the non-technical journalists, the ability to unearth all those connections “was like magic”. Collaborating in research they benefited from each other’s work to see the bigger picture grow more interesting every day.

In this talk I detail the process and the technologies used by the journalists for their investigative work, including Apache Solr Apache Tika, and Neo4j. Then I focus on their work with Neo4j, the data model they developed and the types of queries and interactions that helped them to grow their understanding. We discuss how tools for visual graph exploration and search enable even non-technical users to benefit from working with large amounts of connected data.

Using the officially published dataset of 3.4M records we demonstrate how new insights in your existing, disconnected data are just one graph query away.

Download the Slides

[bctt tweet=”Enabling #PanamaPapers Investigations with #OpenSource Tools like @Neo4j – @mesirii’s @GOTOber slides”]

Mining Repository Data to Debug Software Development Teams

with Elmar Juergens, Consultant at CQSE GmbH

If the team architecture and the technical architecture do not fit together, problems arise. Both architectures evolve, however, often causing misalignment. How can we notice such mismatches and react in time?

In this talk, I present modern analysis techniques that mine data from different software artifacts (e.g. lightweight architecture specifications, code and trace files) and version control systems. They reveal problems in the code or design that result from communication problems in the development team.
I present the analysis techniques using examples from open source and commercial software systems. I also outline both best practices and limitations in employing such analyses that I collected over 10 years.

Download the Slides

[bctt tweet=”Mining Repo Data to Debug #SoftwareDevelopment Teams – watch @ElmarJuergens from @cqse present at @GOTOber”]

Monitor your containers with the Elastic Stack

with Monica Sarbu, Software Engineer at Beats at Elastic

Containers as well as orchestration systems like Kubernetes are quickly gaining popularity as the prefered tools for deploying and running microservices. While being easier to deploy and isolate, containerized applications are creating new challenges for the logging and monitoring systems.

One popular solution for logging and monitoring is the Elastic Stack composed of Elasticsearch, Logstash, Kibana, and Beats. This talk shows you how to use the Elastic Stack, and in particular the Beats lightweight shippers, to collect logs and metrics from your containers.

The session includes details about how to:

  • fetch the logs of the containers with Filebeat
  • collect container metrics with Metricbeat
  • monitor the network traffic exchanged between containers with Packetbeat
  • automatically discover metadata from Docker containers
  • visualize the collected data with predefined Kibana dashboards
  • scaling Logstash deployments

Download the Slides

[bctt tweet=”Monitor your #containers with @Elastic Stack – @packetbeat creator @monicasarbu’s @GOTOber slides”]