Fast, Consistent, Universal Data Acquisition

Introduction

At Getvisibility we process a lot of data to make sure it complies with all the required security policies. But data changes all the time, it gets created and deleted at an always-increasing pace, and we need to be on top of that pace. In this article, I will try to explain one of the fundamental mechanisms we use to scan and acquire data as well as to detect changes in it.

Motivation

Our architecture relies heavily on event sourcing using Kafka, where the first event that our system handles is the discovery of a new file. The speed at which we can discover new documents is critical. We face a unique challenge that has not yet been fully addressed by the industry: achieving file discovery and aggregation from any data source at an unprecedented rate. Additionally, data sources vary significantly (consider downloading a document via the Google Drive API versus cloning a repository on GitHub). Therefore, we needed a universal method to reconcile the inevitable discrepancies between our internal list of files and the actual data from different sources.

Description of the problem

We needed three core requirements: performance, consistency and universality.

Some of our customers manage tens of millions of files, and they expect these files to be processed within a reasonable timeframe. The initial step in this process is discovering the files, and it’s crucial that this happens quickly. Fast file discovery provides immediate feedback to the user, allowing them to understand the volume of data they have right away. This immediate insight is valuable even before the classification of the files is completed. Ultimately, the objective is to ensure the discovery phase is rapid, meeting customer expectations and enhancing their experience.

Consistency means that even if new files are added, modified, moved or deleted, the user needs minimal intervention to reconcile the data on Getvisibility with the updated one. Also, it might happen from time to time that some temporary errors prevent a set of files from being discovered correctly during a scan, so there must be an automatic mechanism to fix this.

Finally, whatever solution we have, it needs to work with as many data sources as possible with minimal developer effort. By making the abstraction general enough to fit with the most disparate sources, we can save a lot of time in refactoring and just focus on writing adapters for that specific new source.

Prior art

While existing Kafka connectors offer varied solutions, none met our specific needs for high-speed, filesystem-like data aggregation, prompting us to develop our own optimized connectors.

Description of the solution

The first aspect to tackle is to find a good abstraction that will encapsulate the concept of discovery in any data source. Our use case is largely based on hierarchical file system containers so we need to be highly optimized for that. Most of our data sources share this trait in which documents are organized according to a tree structure. A few of them are actually a flat list of files, but this is not a problem because this case is included as a tree with the root and no branches, just a level of children.

Having this abstraction enables us to focus on a performant and consistent solution that will efficiently process all the data customers will throw at us.

Crucially we require not be constrained by the speed of discovery of the files. Of course this can be done by making the code faster, but scaling vertically might actually artificially constrain the speed by not fully leveraging the source capabilities. This is because when we ask to list the files in directory "grades" we receive directories for "Class A" and "Class B", we then proceed to scan "Class A" but the source could easily support scanning the "Class B" folder at the same time.

This was cleverly implemented by a concurrent algorithm that, every time a new folder is discovered, all the child folders are enqueued to be scanned by the available workers in parallel. So effectively, we're reaching the maximum available speed of discovery on many sources.

On top of this, we need to amend the scanned data with new updates, any changes or file deletions. For that we developed a reconciliation feature that, periodically, compares the actual source content, with the previously discovered content. This feature works twofold:

It keeps the data current when the source data does not send updates natively (e.g. SMB) so effectively it is a universal way to keep data fresh.
It reconciles any difference that might have happened because of transient network partitions, or other errors.

The periodical reconciliation, furthermore, incrementally sends data for processing after discovery. You can see that, having tens of millions of documents to process, only forwarding the ones that changed is crucial not to waste time and processing power.

One of the challenges of this architecture is keeping track of the progress of the scan. This seemingly "extra" feature is actually necessary if you think about it. How would you otherwise know that the scan is complete? After all, there is no intrinsic order since the moment we decided to turn on the acceleration to the max by parallelizing the discovery process. There is no more a clear-cut way to know when the last file was discovered.

So how did we do it?

A first possible solution would be to make the thread (or consumers in the Kafka context) communicate with each other. The finishing condition being: when all threads are done, the discovery is finished. However, the problem is a bit deeper because for a single thread to know that it itself is done, it needs to know that no more work will be scheduled on it. But this begs the question.

Well, in principle, if all threads communicate with each other, keeping track of where they are in the exploration of the tree, they will know when they are done because they can clearly see that all branches are explored. But this means that there needs to be a memory representation of the entire tree. We can do some math.

We are considering a file system with 10 million directories and 90 million files which is reflective of our biggest deployments. Each directory and file name averages 100 characters. Java object overhead is 16 bytes, reference size is 8 bytes, and string overhead is 40 bytes plus 2 bytes per character.

Directory Names: 2.4 GB (10 million directories * 240 bytes per name)
File Names: 21.6 GB (90 million files * 240 bytes per name)
Directory Objects: 0.4 GB (10 million directories * 40 bytes per object)
File Objects: 2.88 GB (90 million files * 32 bytes per object)

Total Memory: 27.28 GB

If we consider that multiple scans can be operating at the same time, we can see how it may become unrealistic to have this much data into memory.

The intuition was to work at the directory level. We only need to know when a directory is completely explored. If we somehow can manage to do this, then we can just wait until the root is explored. With a recursive definition of "explored" it would mean that all the directories contained in the root are themselves explored and so on until the leaves. This data structure can certainly be loaded into memory from the database and updated accordingly to track the progress of the scan.

Additionally, it’s crucial to keep the order of the events for a particular folder in the Kafka topics. The “explored” event needs to appear only after all the files have been processed. This is guaranteed by the fact that the events are partitioned by the parent directory, such that all children will be seen in a partition, and the explored event will appear in that same partition. Without this guarantee, consumers might track the progress of the scan wrongly, thinking a folder is explored before all children are actually discovered.

This was the basis for the algorithm that we ended up implementing. However we immediately noticed one problem. We completed this whole design precisely because we wanted a concurrent algorithm, so the tracking of progress at the directory level was done in parallel by the single threads. This required Kafka partitions to be aligned with a particular folder; otherwise two threads would step on each other's feet when adding a discovered object to the same folder.

There is one notable exception when two threads end up writing to the same place. You see, whenever a folder is explored it might be the case that the parent folder is explored too (because that was the last child to be explored from the parent perspective). So, after checking if this is the case, the parent needs to be updated. It so happened that when two folders were explored in the same instant, and had the same parent, a race condition happened such that they both updated the ancestor by saying that they were explored. However, this sometimes resulted in the parent never being marked as explored because of the race condition.

We use Elasticsearch as the main database because of its speed in search which is vital in many of our analytics operations and also keeps our UI search snappy. However, Elasticsearch does not have transactions natively. It does provide an alternative, which is optimistic locking. By implementing it in this specific case, we could just detect the race condition and retry the failed update. This solved our problem and progress tracking started being much more stable.

Innovating High-Speed Data Discovery with Kafka at Getvisibility

At Getvisibility, we’ve harnessed the power of event sourcing with Kafka to revolutionize the speed of file discovery, seamlessly accommodating a diverse range of data sources. Our cutting-edge concurrent algorithm not only maximizes data throughput but also ensures rapid processing to meet the demands of real-time data acquisition.

We’ve tackled the challenge of data consistency head-on. Our reconciliation feature proactively addresses changes and errors, maintaining up-to-the-minute accuracy with minimal user intervention. This robust approach is further complemented by a smart abstraction for hierarchical file systems, tailored for versatility across various environments.

Our novel design cleverly tracks scan progress at the directory level while employing Elasticsearch’s optimistic locking to mitigate the risks of concurrent processing and race conditions. This strategic integration guarantees the reliability and precision of our scans, enhancing the user experience.

Stay tuned for more insights as we continue to refine our solutions, ensuring that Getvisibility remains a leader in data security and management.

‍

Customer retention is the key

Lorem ipsum dolor sit amet, consectetur adipiscing elit lobortis arcu enim urna adipiscing praesent velit viverra sit semper lorem eu cursus vel hendrerit elementum morbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis.

Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor
Adipiscing elit ut aliquam purus sit amet viverra suspendisse potent
Mauris commodo quis imperdiet massa tincidunt nunc pulvinar
Excepteur sint occaecat cupidatat non proident sunt in culpa qui officia

What are the most relevant factors to consider?

Vitae congue eu consequat ac felis placerat vestibulum lectus mauris ultrices cursus sit amet dictum sit amet justo donec enim diam porttitor lacus luctus accumsan tortor posuere praesent tristique magna sit amet purus gravida quis blandit turpis.

Odio facilisis mauris sit amet massa vitae tortor.

Don’t overspend on growth marketing without good retention rates

At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.

Neque sodales ut etiam sit amet nisl purus non tellus orci ac auctor
Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti
Mauris commodo quis imperdiet massa tincidunt nunc pulvinar
Adipiscing elit ut aliquam purus sit amet viverra suspendisse potenti

What’s the ideal customer retention rate?

Nisi quis eleifend quam adipiscing vitae aliquet bibendum enim facilisis gravida neque euismod in pellentesque massa placerat volutpat lacus laoreet non curabitur gravida odio aenean sed adipiscing diam donec adipiscing tristique risus amet est placerat in egestas erat.

“Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua enim ad minim veniam.”

Next steps to increase your customer retention

Eget lorem dolor sed viverra ipsum nunc aliquet bibendum felis donec et odio pellentesque diam volutpat commodo sed egestas aliquam sem fringilla ut morbi tincidunt augue interdum velit euismod eu tincidunt tortor aliquam nulla facilisi aenean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.