Advanced document processing by Big Data techniques
To build up a professional document management system is crucial for every organization. It usually provides the functions like document storage, document classification, access control, and collaboration. Nice, but is it enough? Can we really use the information stored in these files effectively? In this post we will show you how can we gather and use valuable information from unstructured documents by Big Data tools and techniques.
Most of the companies deal with large amount of unstructured data in various file formats. The most popular types are the different versions of Word, Excel or PDF, but also remarkable the scanned documents and other images. The unified process of them could be a great challenge due to the diverse file types. The good news is “to processing various data” is one of the main definition of Big Data (among “large volume” or “fast velocity”), so we have powerful Big Data tools to apply. We can analyze the documents’ metadata, get the content in unified text format even from scanned documents or build up a “google like” internal search engine. To develop a custom Big Data application with the previously detailed features, we can use a plenty of open source software components. Let’s see a Content Extractor and an OCR solution in details.
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these documents can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. With Apache Tika we can grab all the metadata and the text-based content from any popular document type.
Tesseract is an Optical Character Recognition (OCR) engine with support for unicode and the ability to recognize more than 100 languages out of the box. It is free software, released under the Apache License, and the development has been sponsored by Google since 2006. We can effectively use it to extract text content from scanned documents or any other images. Nowadays Tesseract is considered the best open source OCR engine, regarding the recognized text’s accuracy.
Beside the effective content, the mentioned documents contain a lot of metadata too. The most common metadata are: author, creation date, last modification date, last modifier, creator tool, language, content type ...etc. In case of images we also have metadata about the application that optionally modified the original photo and maybe the exact GPS coordinates of the location too. If we can extract these data and store unified in a database, we will have the capability to run advanced search queries on it. On the top of that we can create analytics or visualization about our documents too (for example: the distribution of “Creator tools”, number of documents created or modified in a certain period ...etc).
Build a search engine
It’s obvious that, the more business documents we store, the harder to find the relevant information we are looking for. In this situation a custom internal search engine could be a very useful tool for all the organization. To build up a search engine, firstly we have to process all the documents we have, grab the content, index and store it in a special database optimized for quick full-text search queries. In case of scanned documents, it’s needed to apply Optical Character Recognition (OCR) also to convert the scanned image to interpretable text format. After the initial document process we have to build an automated data pipeline that ensures the processing of new or modified documents continuously. During data process we can also define keywords and tagging the given document with found keywords in its content.
Interactive maintenance guide
In a factory with many production lines, the regular maintenance is a general task. However the maintenance manuals are usually not unified and sometimes it’s hard to find the relevant documentation for a given machine or part. Furthermore the manuals could be updated regularly so it’s important to use the appropriate version. To support this task we can build a common interactive maintenance guide for all the working machines in the factory. This guide could provides step-by-step maintenance instructions for every machine and stores the previous versions of the documents too. To implement a system like this we have to process all the available maintenance manuals, find the relevant parts in the document and load it into a unified database. With this continuously updated database and a well designed user interface, the execution of maintenance tasks will be more effective with less fault.
Project methodologies for Industry 4.0
We see many factories struggle with implementing successful projects aligned with an Industry 4.0 initiative. Success doesn’t always mean a financial return directly, it can bring better worker and customer satisfaction, environmental benefits and more; thus, being lucrative on the long term. But anyway, there must be a gain from any project which in turn drives your company along the (hopefully long and fruitful) Industry 4.0 journey.
We have seen several good initiatives literally die because of one very important aspect: The lack of a reasonable use case that makes more appetite for using Big Data solutions in the factory. During the last couple of years, we have gained experience at customers and international hackathons resulting in a well working set of methodologies that help create value at almost any company.
Not sure if there is value in your use case? Do data pre-evaluation
Many times, companies over-plan the implementation of use cases instead of experimenting with what they have and iterate until they have the final solution. Considering building an expensive data pipeline for a predictive maintenance system? You better check the quality of the data and take the first steps with an offline system before building something big.
This is the situation our so-called “data pre-evaluation” methodology is developed for. In only 10 man-days (optimally lasting about 4 weeks depending on your availability), you will be able to decide whether to invest in a real-time and precisely built version of the same idea.
First, prepare and get to know the data by using standard data mining and analysis tools. In one case, our customer had frequent breakdowns on their machining equipment, and process data was present in CSV files. We have cleaned and investigated the data and concluded that it will be very likely to build a predictive application by using that data as a training set and calculating the predictions on the live stream data. (Of course, re-training of the system can happen anytime after the initial training as well.) In later phases of the idea implementation (see the Proof-of-Concept later in this blogpost), the predictive application helped reduce downtime significantly.
If engineers and decision makers understand how powerful a potential application could be (based on the pre-evaluation), they will likely want to build a Proof-of-Concept, which is usually the next step in the iterative process.
Ready to make the next step? Go for the PoC (Proof-of-Concept)
A PoC is meant to be something that its name refers to: proving that the initial assumption or concept (that the product will meet the customer’s demands, the solution will bring the desired benefits and so on) was right and the solution satisfies the initial claims.
We strongly believe that a PoC needs to contain at least 3 use cases (of low hanging fruit type). This increases the chance that out of these three, at least one will generate a financial return within one year. Which is again very important to convince decision makers inside the factory.
Our agile PoC methodology takes about 3 months and has three main phases. In the first iteration (3 weeks) we explore the use case more deeply than in the “data pre-evaluation” methodology, because we need to build the foundation for a future production environment. 3-4 iterations of modelling and development follow this phase, where the data and models are prepared, the data pipeline(s) are built, and these get evaluated.
At the end, we take at least 1 week and several discussions with our customers to understand the business insights that we have gained and evaluate the current and future value of the project. Finally, we report the results to the management and discuss next steps.
Industry 4.0 roadmap – designing strategies
We have been through numerous project implementations with manufacturing industry players. Sooner or later every company realizes that either a complete industry 4.0 or Smart Factory roadmap needs to be developed, or the existing strategy is to be updated or fine-tuned. We can help you identify key areas in the factory where proven Big Data technology can help make production more efficient and profitable and help design the strategy that will result in a Smart Factory one day.
Sometimes you need to spice up the idea
We have learnt a lot during and from Industry 4.0 hackathons
. One of the most important takeaways is that a spectacular demonstration of the use case is necessary to convince decision makers to start the project that will be deployed in a production environment later.
Using the “hackathon method” helped us convince key people to continue and profit from the Industry 4.0 roadmap. For example, the spectacular demonstration of a prototype version of the machining use case was enough to prove that it is worth taking the next step towards the system in production mode.
Summing it up, if you feel that you need to start building an extensive solution or application for a specific problem, you better run a small evaluation with the data and information that you already have (if you are playful, you can even use the hackathon format). And if you need help, we are glad to help you with our proven methodologies. Get in touch!
Digital Transformation 4.0
Digital Transformation covers how companies can upgrade their operations with technology. People like to use this term as something „revolutionary” and a brand-new approach. Despite of this, the digital transformation is not a new concept. In fact, we are faced the fourth wave of digital transformation. Especially in the industrial sector, where the digitalization has long history. Let's look at it briefly.
Digital Transformation 1.0 (1970s): Initial Digitalization.
The whole story began somewhere around the PLCs at the end of the ‘60s. PLC means „programmable logic controller” which is an industrial digital computer. It has been adapted for the control of manufacturing processes, such as assembly lines or any activity that requires high reliability control, ease of programming and process fault diagnosis. A PLC is an example of a real-time system since output results must be produced in response to input conditions within a limited time, otherwise unintended operation will happen.
Richard Morley is generally known as the “Father of the PLC”, but General Motors delivered the first batch of PLCs in November of 1969.
Digital Transformation 2.0 (1980s): Transformation to Paperless Procedures.
In the initial wave the number of programmed solutions has grown steadily. The development of electronic data interchange systems in the 1960s and 1970s paved the way for the second wave of digital transformation.
In this phase computer systems already supported different business activities (e.g. booking, invoicing, ordering, accounting). They enabled different planning, management and coordination of interdependent activities which are paperless interactions even with external partners. But to tell the truth, during the 1980s the paperless office simply meant that all forms of paper (documentation) should be converted to digital format.
Digital Transformation 3.0 (1990s-2000s): Transformation to Automated Procedures.
The trend of using information technology continued during the middle and late 1990s. Particularly automatic identification and positioning technologies were introduced in the mid-1990s to improve the efficiency and safety of operations. The major change was the collection methods - e.g. adoption of new handling technologies equipped for example with sensors.
Automation of certain processes often required complete redesign of organizational structures, policies, and business process activities as well as an efficient information management (this was called at that time BPR – business process reengineering, probably most of us remember it).
Limitations of static information were still experienced, but higher visibility and different forms of decision support - based on accurate data - become increasingly important to enhance responsiveness during operations.
Digital Transformation 4.0 (2010s-): Transformation to Smart Procedures.
The basic idea is to integrate different systems and data silos into a central platform - that allows decision-making and an ongoing interaction with stakeholders being actively involved in manufacturing activities - based on real-time data. Data silos occur for different reasons. The earlier digital transformation waves produced tons of different IT applications and once a company is large enough, people naturally begin to split into specialized teams in order to streamline work processes and take advantage of particular skill sets.
In the new era, although lots of data are still processed in isolated systems, but in parallel they are immediately transferred to a central information system to explore, analyze and distribute relevant and valuable information over different channels to various targets (humans, systems, machines).
Maybe the most important part of this phase is the “rise of artificial intelligence”. The artificial intelligence helps to get the work done faster and with accurate results. A central and intelligent information system shall facilitate the integration and provides the necessary resources to flexibly fulfill the required business agility.
The different, but particularly collaborative systems and devices’ integrated operation could be realized in a central solution controlled by AI, which can handle simultaneously the past’s heritage, the present’s urgency and the future’s uncertainty for the agile mass customization on the whole value chain. From an IT perspective this needs an IoT platform for the integration and the fast and smart processing. That’s what REACH offers. And this is the point where Digital Transformation 4.0 and Industry 4.0 directly meet with each other.
Data assets at a factory
Earlier we have emphasized that the biggest winners of industry 4.0 will be companies that find out rapidly how to turn their data into real business benefits. In this article we will show you why an integrated IoT platform is the proper solution.
IoT and the “data revolution” is not disrupting manufacturing businesses as it does other industries like telco and retail. Manufacturing companies seek the optimization potential in their data and the intelligence that can be provided to it; making production more efficient, reduce scrap and waste, in other words: support lean manufacturing.
Think about today’s technological advancements in manufacturing compared to the production of the Ford Model T. Many machines are automated and most of them collect data. And still, fewer than 5% of machines in factories are monitored in real time. This is a huge obstacle for full transparency.
Estimating the value of data and the information contained in it is essential to decide, which project to implement in industry 4.0. Let’s see the major approaches that can be used to determine the value of data within an organization.
Different approaches to measure the value of data:
- Benefit monetization approaches: Value of data is estimated by defining the benefits of particular data products, and then monetizing the benefits. Example: A machine starts producing waste from time to time, with no significant change in operation state. When this happens, the machine is stopped for 20-30 minutes and the tools get cleaned. If producing waste could be predicted, the operators could avoid the problem. This generate savings of 1 hour machine time every day and about 50 waste pieces per day. In this example, we use data from the machine to predict unwanted events, avoid them by intervention and create measurable benefits.
- Impact-based approaches: Here, the value is determined by assessing the causal effect of data availability on economic and social outcomes, even within the organization. Also, (processed) data that makes daily work more effective and helps reduce the frustration of workers falls in this category. For example, if repetitive, boring work can be automated, it lets workers and analysts do work with more added value and feeling happier.
There are more approaches (e.g. cost-based, market based, income-based) but the two above are the most applicable to the manufacturing industry.
Go for the business benefits
The key question is: will the data provide measurable and tangible business benefits? The critical first step for manufacturers who want to make use of their data to improve yield is to consider how much data the company has at its disposal.
Most companies collect vast troves of process data but typically use them only for tracking purposes, not as a basis for improving operations. For these players, the challenge is to invest in the systems and skill sets that will allow them to optimize their use of existing process information.
The data silo problem
Having the data is not enough. Very often, data in companies remain under the control and use of distinct departments, and the information flow is blocked. In these situations, we talk about data or even information silos.
A data silo is a repository of fixed data that remains under the control of one department and is isolated from the rest of the organization.
Data silos are huge obstacles if a company wants to make operations more visible and transparent. If you notice that data silos in your organization have been developed, you may want to look for a solution and build bridges between them. In this case, you will probably end up with an IoT platform.
The solution – an IoT platform
An IoT platform like REACH
is basically a nervous system of any factory. It connects different functional units, machines and sensors with humans, it transfers signals in both ways, stores and analyzes data and it must be capable to exhibit an intelligence to some extent.
Data must be flowing from different production phases, machines and departments with no friction, and the predictive maintenance algorithms need to watch over the whole procedure in order to early enough alert the right staff that can prevent or eliminate the failure. If you want to reach this functionality, a cross-department/cross-operational IoT platform must be in operation at your company.
This is the preliminary condition of any integrated Machine Learning and Predictive Maintenance solution
Sometimes, only one missing link between data sources can provide a huge benefit. Imagine that you operate a gluing machine at some point in an assembly process. The adhesion force of the glue varies with time and you don’t know why; there are days when you produce 20% scrap because of insufficient adhesion quality. Even if you analyze the data that is collected from the machine, there is no pattern that would imply a causality between machine data and the final product quality. At some point, you get the idea of joining the machine data with the factory weather station, and you find that gluing quality correlates with air humidity, so you can start solving the problem and reduce scrap significantly.
Implementing a profitable use case
Implementing a use case that provides measurable business benefits is not solely dependent on the data sources and the quality of the data. The ability of creating real value must lie within the organization and require good methodologies.
In our next blog post, we will explain some of these proven techniques that help companies implement successful and profitable use cases and projects.
A key factor for applications dealing with lots of data – including complex event processing – is security. Nowadays as IoT is more popular than ever, one can hear more and more stories about security breaches, as the simple internet connected devices are often less secured, thus more vulnerable to different types of attacks.
REACH uses Fog Computing , which means none of the data leaves the factory's territory, making the external attack itself impossible. Of course, this doesn't mean that everything is secured, as if hundreds, thousands, or even tens of thousands of employees and vendors can access every data without restrictions, the chance of a potential disaster is excessive. Just think about what could happen if someone deletes all the data collected the past years - it doesn't matter if it was intentional or not.
Most companies only think about security after the Armageddon already happened – such as a leak or destruction of private data. All of the incidents are avoidable with enough care. Fortunately, there are multiple ways to address these problems, and REACH also has these solutions integrated together by default.
Kerberos – developed by MIT – plays a key role in authentication, to only let people (and services) access the data if they can prove their identity. The client authenticates itself at the Kerberos server, and receives an encrypted timestamped ticket-granting ticket (or TGT for short), and whenever it wants to access a new service in the TGT’s lifespan, it asks for a separate ticket for that exact service. The different services are accessible only with these valid tickets, which also have a lifespan so they are unusable after a short period, and all the tickets are encrypted with AES256, which could take an eternity to brute force with billions of supercomputers.
The next level of security is authorization, where rules specify who can do what. Lightweight Directory Access Protocol – shortened as LDAP – is an industry standard for distributed directory access, created by University of Michigan. It’s OpenLDAP implementation is fully open source, and integrates well with Kerberos, what makes them a perfect fit for security. It holds all the information about the users and services, and tells which user has permission to access a specific resource.
However, one piece is still missing: what if the fog devices are communicating with each other? They still have to send data across the local network to collaborate, and one could sniff those packets. The solution for this is the usage of Transport Layer Security (TLS), which is a cryptographic protocol. It encrypts the data over the network, so only the intended recipient can open the messages.
Remember, no matter how tall, spiky, strong fence you have at 95% of your territory’s circumference, your fence is as strong as its weakest part. Any of the above technologies wouldn't be enough alone, but together they form an all-round security layer to protect your valued data.
The role of Industrial IoT in maintenance and manufacturing optimization
Maintenance is a task that is carried out in factories on a daily basis for keeping machines healthy and the whole manufacturing process efficient. The main goal is to do maintenance before a particular machine starts producing waste or even suffers complete failure. It is easy to prove that preventing machines from being stuck means lower operating costs and helps keeping production smooth and fluent. Still many factories have a hard time dealing with downtime due to asset failure.
Standard maintenance procedures – Preventive Maintenance (PM)
The purpose of regular care and service done by maintenance personnel is to make sure that the equipment remains productive, without any major breakdowns. For this purpose, maintenance periods are specified conservatively, usually based on data measured by the equipment manufacturer or at the beginning of operation. However, these procedures do not account for the actual condition of the machine resulting from different environmental effects – like ambient temperature and air humidity –, raw material quality, load profiles and more.
In order to take the ever changing operation conditions into account, condition-relevant data needs to be collected and processed. This is condition based maintenance (CBM). In case of machining it is essential to measure ambient parameters, machine vibration, sound and motor current, which give a picture about the concrete health state of the machine and machining tools. The availability of this data enables making the step towards a more sophisticated maintenance mode: Predictive maintenance.
Bring maintenance to the next level: Predictive maintenance (PdM)
Maintenance work that is based on prediction presumes fulfilling following requirements. First of all, the data that is collected from the asset contains the information showing the signs of an upcoming event. In other words, patterns precisely describing each event can be identified in the measurement signals. If this hypothesis holds true, the next step is to either hard-code the conditions indicating oncoming failure, or use a Machine Learning algorithm to identify and literally learn the particular failure mode patterns.
The second requirement comes into picture as soon as the patterns have been identified and the model is capable of predicting the unwanted event soon enough to take action. This requirement addresses the architecture of the system making the prediction: it needs to operate in real time. The reason is that there are many applications where damage can be predicted only shortly before the event (usually measured in minutes or seconds). Advanced IIoT systems feature real-time operation.
And last but not least, manufacturing situations where multiple machines and robots are involved – a group of machines together having impact on the product quality or operation efficiency of consequential machines –, events and data are very complex. Processing this data requires a system that has high computing performance and tools to handle the complexity.
The complex data processing and predictive capability of REACH lies in the most advanced Big Data technologies, built-in Machine Learning algorithms and the real-time Fog Computing architecture . The system is capable of learning and distinguishing between failure modes and sending alerts in case of an expected breakdown.
Although Predictive maintenance is a big step for most manufacturers, there is still a next level to go for.
State-of-the-art: Prescriptive maintenance (RxM)
Prescriptive maintenance requires even more detailed data and a checklist on what actions to take in case of a detected failure mode pattern. Although technology makes implementing prescriptive systems utterly possible, only few organizations make it to this point. This step requires a very good harmonization between maintenance and production departments, a fair understanding of the problem and efficient cross-department information sharing. These are the key criteria of a successful Industrial IoT implementation anyway.
Manufacturing companies that take the effort to collect data, analyze the problem, identify patterns of inadequate operation, understand and prepare their data, can reach the level of a Smart Factory regarding maintenance operations, too. In the presented case, not only the checklist is being displayed on the REACH UI, but emails and SMS can be sent to maintenance personnel and other relevant stakeholders. This minimizes the time required to prepare for required actions.
Besides using email and SMS alerting, REACH can send status messages to engineers and other personnel even via our chatbot called RITA. Using state-of-the art technology needs to be fun, too!
In our earlier posts we talked about how to store, process and analyze data, but missed a crucial step, how to collect them. An outstanding challenge for the IoT lies in connecting sensors, devices, endpoints in a cost effective and secure way to capture, analyze and effectively gain insights from the massive amounts of data. IoT gateway is the key element in this process, below we describe why.
The definition of an IoT gateway has changed over time as the market developed. Just like traditional gateways in networks do, IoT gateways function like bridges – and they bridge a lot, positioned between edge systems and our REACH solution.
IoT Gateway market
IoT gateways fulfil several roles in IoT projects. IoT gateways are built on chipsets that feature low-power connectivity and may be rugged for critical conditions. Some gateways also focus on fog computing applications, in which customers need critical data so that machines can make split-second decisions. Based on this IoT vendors can be divided into three groups. Vendors those who just give hardware (Dell), companies who focus on softwares & analytics (Kura, Kepware) and the end-to-end providers (Eurotech).
Our IoT Gateway solution belongs to the software & analytics group, which is an OPC client (communicating with an OPC server). OPC is a software interface standard that allows secure and reliable exchange of data with industrial hardware devices.
What are IoT Gateways?
Gateways are emerging as a key element of bringing legacy and next-gen devices to the Internet of Things (IoT). Modern IoT gateways also play an increasingly important role in helping to provide analytics so that only the most important information and alerts are sent up to the REACH to be acted upon. They integrate protocols for networking, help manage storage and analytics on the data, and facilitate data flow securely between edge devices and REACH.
Mainly in Industrial IoT there is an increasing movement towards the fog as is the case in many technologies.
Intelligent IoT gateways
With fog computing (and the movement to the edge overall) we really enter the space of what is now known as an intelligent IoT gateway. Whereas in the initial and more simple picture an IoT gateway sat between the sensors, devices and so forth on one hand and the cloud on the other, a lot of analytics and filtering of information is now increasingly done closer to the sensors through fog nodes for myriad possible reasons as explained in our article on fog computing. The illustration below shows where the intelligent IoT gateway (and soon they’ll all be intelligent) sits in an IoT architecture.
(img source: https://www.postscapes.com/iot-gateways/ )
Machine learning is a process how we make software algorithms to learn from huge amounts of data. This term was originally used by Arthur L. Samuel, who described it as: “programming of a digital computer to behave in a way which if done by human beings”. ML is an alternative way to build AI with help of statistics to find patterns in data rather than using explicitly hard-coded routines with millions of lines of code. There is a group of algorithms, that allows to build such applications that can receive input data and predict an output dependings on the input. ML is also can be understood as a process, when you “show” tons of data – text, pictures, sensor data – to the machine, with the required output – this is the training part – and then you “show” a new picture without required output and ask the machine to guess the result.
Machine learning has grown to be a very powerful tool for various problems from different areas, for example text processing – for categorizing documents or speech recognizers (chatbots) – or image processing – where we train the algorithm with hundreds of thousands of tagged pictures, to be able to recognize persons, objects, etc.
However, from our point of view, the more important use-cases are those where factory machines and processes are involved. In this case we have to collect, assort and store many different sensor readings from different producing robots to train our machine learning solutions for different purposes – like predict these machine’s failures.
Before training, different tasks have to be done, like data preparation - which contains for example filling or throwing empty cells out, standardize data, sort the important features etc. - training - involves feeding the cleaned data, to the algorithm, to adjust itself - and finally, we have to be able to measure and improve our solution with new data, and parameters.
For these purposes a data pipeline should be built, using the same processes to a newly arrived data as we have done in the training state. With REACH, we are able to do all of these tasks - data preparation, training and building the data pipeline - easily, and in a user friendly way through the UI, for useful solutions which will result in downtime and cost reduction, too.
Machine Learning within REACH
We provide solutions for many different problems occurring during the lifetime of a machine learning project: we have different tools for different roles: we provide an easy and simple graphical UI with pre-configured models, which helps you to focus only on the data and the pattern behind it.
Of course, whit this approach you will also be able to tune the model parameters and compare them to find out which parameters are more suitable for your application. For developers who want to build their own solution, REACH is also perfectly suitable; with an embedded jupyter notebook, users are able to build any model with different technologies – like scikit learn, spark ML lib, tensorflow, etc. These models could be deployed to a single machine or distributed to the cluster to reach the best performance and better scalability.
Towards to the future
Machine learning as a term is so pervasive today, but many people use it in wrong way, or mix it up with AI or deep learning – our following blog posts topics. Whit this introduction you can get a little insight from this technology to have a general picture how to build applications which are able to improve their performance without any human interaction by analysing data and using feedback of performance.
As we discussed in our previous blog post, data lake and Big Data is a required technology to cover all the needs of the Industry 4.0. But what about my classic data? Should I transfer them to a data lake? Do I have to redesign all my application and process to use the new storage layer?
The answer is definitely not. You do not need to throw out your classic databases, systems and processes that uses them. Our terminology and best practices says that a classical database engine can live in a smooth symbiosis with a modern Big Data data lake system approach. It’s only about a well-designed architecture, which is not easy to create. That’s why our experts designed REACH to be ready to handle such situations and we call it a hybrid architecture, where all data goes its proper place. Some data should be stored in the data lake system, but some of them should go to the classical storage layer.
The question is where to combine these datasets. REACH is designed to be ready to combine these different data sources also on process and analytical level, so at the end of an analysis you cannot distinguish whether a data came from the classical layer or from the Big Data data lake.
We believe in the proper storage technique that says all the data must go to its proper storage layer, we should not say use only one kind of storage for all your data. Going further by implementing this technology our experts designed REACH to handle multiple storage layer not only saying this should go to the big data layer or into a classical layer but saying that you should use different storage techniques inside the data lake and also in the classical storage layer. A good example for this where Kudu, HBase and HDFS lives next to each other by extending the storage techniques from the classical layer where relational database storage is also mixed with standard file storage techniques.
That is why we cannot say that one database engine is good for all. REACH is designed to support this multistorage approach to get the maximum out of your data.
In the world of Big Data traditional data warehouses are not sufficient anymore to support the requirements of the 4.0 level industry and to become the foundation of truly real-time solutions. In contrast to the structured data storage concept of the traditional data warehouses, a Data Lake can offer a solution that will keep the original format and state of the data and provide real-time access to them. The greatest advantage of a Data Lake is that it is capable of storing tremendous amount of data while preserving its raw format in a distributed, scalable storage system. Therefore, it is possible to store data coming from various data sources so that it is adaptable for future requirements, resulting in such a flexibility that current data warehouses cannot provide.
What can a Data Lake offer?
The concept of a Data Lake enables factories to fulfill the requirements of the Industry 4.0, to make data generated during production accessible for other participants in the production line, in the swiftest, smoothest way with the help of the Complex Event Processing method (introduced in our previous blog post), as the data is stored in its original raw format and no data transformation will slow down this process. For this reason, REACH is putting Data Lake in the heart of its architecture so that we could contribute to the competitiveness of our partners. Without the modernization of the storage process, real-time analysis and automations are not possible. Companies that seek to utilize machine learning methods, need to possess a wide range of data sources to provide the sufficient amount of data for the algorithms.
Cost is also an important element: in case of a Hadoop-based Data Lake utilizing well-known big data techniques, storage costs are minimal compared to a standard data warehouse solution, because Hadoop consists of open source technologies. Furthermore, its hardware requirement is also lower due to its distributed setup, so it can be built even on commodity hardware.
Should data warehouses be replaced?
One of the main aspects during the design of the REACH architecture was integrability, therefore it offers interfaces to connect to various data sources and applications. See our upcoming blogpost of hybrid architectures!