Search

The Rise of Micro-Data Lakes at the Edge



Author: Derek Mak Date: March 14, 2020


In my last blog, I talked about the notion of keeping ALL the data at the edge. Data retention has many issues such as privacy, legal, and compliance requirements. The focus of this blog is on IoT data and a call to action for businesses to keep as much data as they can afford. IoT data, data coming from sensors, machines, camera, vehicle, or what I will refer to as “edge data”, is especially important for businesses in the age of all things digital. Some say data is the new oil, or data is the new currency, which is partly true. A more appropriate phrase, in my view, is “data is the new crude oil”. This is a better analogy, as data coming out of IoT devices must be processed, such as cleansed, normalized and transformed to be useful. This is similar to a refinery process, which produces useable products. Obviously, the more data you keep, the better equipped your business would be to understand the data – the point of IoT as mentioned in my last blog.


There are three issues to address when it comes to keeping data: 1) do you have the right to keep the data? Would you trigger any privacy, legal, or compliance issues? 2) technically, one can always establish a connection to the data source, but can you make it repeatable and scalable? Enterprise IT has been doing this for decades, so how is IoT data different? 3) while the crude oil analogy is good, curated data is not exactly like oil. Gasoline is gasoline. It can be used in any vehicle. Data, however, is dependent on the context of the applications. A recent conversation with Dr. Timothy Chou at Stanford University caused me to ponder on these issues.


Let me take a step back to explain why businesses should focus on IoT data in the first place. There is no doubt that all businesses must become technology-savvy in the 21st century. There is no exception. Your ability to leverage advanced technologies to reshape or reimagine your core business separates you from your competition. Consider Nike in sports apparel, Starbucks in coffee, Netflix in entertainment, Amazon in retail, and Tesla in automotive… There are plenty of examples. To use technology to redefine the core business requires an understanding of three things: infrastructure, application, and data.

Technology infrastructure such as computing servers, cloud, networks, and storage are all plumbing. You need to have them. However, this is the layer your customers don’t see and frankly don’t care about; as long as it works. The application is what customers (employees, partners, and suppliers) see and interact with. This is the layer that defines your brand. However, the application is not useful without the most critical part of the trio – data! Data is the fuel to applications. Your accounting application would be useless without your sales transaction data. Your CRM application would have no meaning without your customer data. Equally, your production analytics application, for example, would have no insights without your production machines and other IoT assets data. In other words, to become technology-savvy the first step is to do a deep dive into how you use your data. In the consumer market, take the FANG group (Facebook, Amazon, Netflix, Google) as an example. You don’t need to be a data scientist to understand the impact of data on their business. They are worth hundreds of billions!


Coming back to the three issues I raised earlier: 1) do you have the right to keep the data? It depends. If you are talking about machine data, for example, Fanuc. They will claim that they own all the data related to the machine’s health. GE would say the same thing. So good luck on trying to get their data for predictive maintenance analytics. However, if you are the manufacturer using Fanuc’s robotic arm to produce products, the manufacturer would say that they own all the data related to the production of their products. For example, when they turn on the machine, how long did it take to complete the production cycle, did the machine break down during production. As you can imagine, there are grey areas and it gets complicated quickly. What is the solution? A neutral party, a middleman who is impartial to the data. A middleman who can enable the data coming together legally and securely to solve problems that can only be solved with data from diverse sources. That middleman is called a Data Exchange. A close example of a data exchange in the industrial sectors is OSIsoft’s Pi System. It is particularly effective and still the leader in the world of OT. I said Pi is a “close example” because it is not designed to be a data exchange platform. A data exchange is a multi-party secure data sharing platform (e.g. sharing data with trading partners). It also offers a way for data owners to monetize their data. That is neither the intent nor the direction of Pi. Data Exchange is an emerging category and here are some examples (I will offer my views on these approaches in my next blog):


https://terbine.com/

https://www.dawex.com/en/

https://www.snowflake.com/data-exchange/

https://aws.amazon.com/data-exchange/


But to get to the IoT data, you still need to connect to the data sources to extract data that lead to issue 2) can you connect to assets at scale? Enterprise companies such as Informatica have been providing solutions to extract data from multiple sources for decades. In that world, the process is commonly known as Extract-Transform-Load:

https://www.informatica.com/services-and-training/glossary-of-terms/extract-transform-load-definition.html#fbid=2STq9aptI6X


Here is another good article on the subject: https://www.bmc.com/blogs/is-etl-extract-transform-load-still-relevant/


How is IoT data different? There are some key differences. Top 3 comes to mind are a) always real-time; b) continuously streaming, and c) no standard format. Because of these key differences, practitioners are suggesting a shift from ETL to ELT (Extract-Load-Transform). Nonetheless, the point here is that there are enough use cases in this space to indicate that it is technically feasible to have a scalable way to connect and extract data from the IoT assets.


What remains then is the question of what data to keep? Let me repeat “you cannot discover insights with data you don’t keep” (from my last blog). The very notion of ELT suggests that in the world of IoT, you must keep data you know you need today and data you might need tomorrow. Keeping ALL the data also helps address the 3rd issue I raised, that is, data is dependent on the context of the application. Not only you need the ELT capability for IoT data, but you also need the ability to “transform” the raw data based on the application context. Machine Learning/Artificial Intelligence (ML/AI) and deep learning are all predicated on having data, lots of data. On a recent update of Tesla’s software, it stated that it uses deep learning neural networks to study one million raindrop images on windshields to improve its smart wiper function. The same set of raw data, however, will need to be conditioned in a different way to automatically adjust for the intensity of the vehicle’s headlights in various rain conditions.


Naturally, the cost is the next issue that comes to mind for keeping and forwarding all the data to the cloud for analytics. It would be cost prohibitive. The answer is now starting to emerge with the latest development in edge computing. There will be ample computing power and storage capacity available in the coming edge infrastructure. There is no longer a trade-off. Taking this thought a step further new possibilities would arise by connecting these data sources at the edge and creating a data lake at the edge. This would enable a whole new category of edge-centric application never possible before. Yes, I am advocating the creation of “micro-data lakes” at the edge! Why? It eliminates the trade-off decision you must make for what data to keep and what (and when) data to forward to the cloud. The micro-data lakes can support independent operations in the event of communications disruption. Data aggregation to the corporate data lake can still occur during low-cost hours or other alternative methods. With the coming of 5G, high volume data communications will enter a new era of low cost. Remote, Mobile, and branch operations are screaming for this set up to digitize their business. I just Googled the term “micro-data lake”, no results. The date is December 21, 2019. I am hoping to coin a new term, at the risk of the data architect community disagreeing with me for not using terms like “data pond” or “data swamp”. 🙂


In conclusion, keeping all the IoT data is crucial not only for what you need to know today but more importantly, what you don’t know you need to know for tomorrow. However, keeping the data is a complex matter and businesses will have to overcome three challenges. These are: data ownership, continuous data ingestion at scale and contextualization of data. All these challenges can be addressed with the emerging edge computing infrastructure and data architecture.

18 views1 comment