Data deduplication

Data deduplication is a process in which redundant copies of information are eliminated, ultimately reducing the running costs of storing this information. With this technology, you can optimize the capacity of any data warehouse.

Regardless of the method, deduplication allows you to save only one unique piece of information on media. Therefore, one of the most important points in deduplication is the level of detail.

Data deduplication has several levels of execution:

  1. bytes;
  2. files;
  3. blocks.

Each such method has its own positive and negative sides. Let’s look at them in more detail.

Hybrid cloud Storage

more details

Data deduplication methods

Block level

It is considered the most popular deduplication method, and involves analyzing a part of the data (a file), with further preservation of only unique repetitions of information for each individual block.

In this case, a block is considered to be one logical unit of information with a characteristic size, which may vary. All data in block-level deduplication is processed using hashing (for example, SHA-1 or MD5).

Hash algorithms allow you to create and store a specific signature (identifier) in the deduplication database, which corresponds to each individual unique block of data.

So, if the file is changed over a certain period of time, then not the whole file will get into the data warehouse, but only its modified blocks.

There are 2 types of block deduplication — with variable and constant block lengths. The first option involves distributing files into blocks, each of which may have a different size.

This option is more effective in terms of reducing the amount of stored data than when using deduplication with a constant block length.

File Level

This deduplication method involves comparing a new file with an already saved one. If any unique file comes across, it will be saved. If the file you find is not new, then only the link (a pointer to this file) will be saved.

That is, with this type of deduplication, only one version of the file is recorded, and all future copies of it will receive a pointer to the original file. The main advantage of this method is the ease of implementation without serious performance degradation.

Byte level

In principle, it is similar to the first deduplication method in our list, but instead of blocks, a byte-by-byte comparison of old and new files is used here. This is the only way in which you can guarantee the maximum elimination of duplicate files.

However, byte-level deduplication also has a significant disadvantage: the hardware component of the machine on which the process is running must be extremely powerful, since higher requirements are imposed on it.

Data deduplication and backup

In addition to all of the above, in the process of creating a backup copy of data, deduplication can be performed using different methods:

  • the place of execution;
  • the data source (client);
  • the storage side (server).

Client-server deduplication

A combined method of data deduplication, in which the necessary processes can be run on both the server and the client. Before sending data from the client to the server, the software first tries to “understand” what data has already been recorded.

For such deduplication, it is initially necessary to calculate the hash for each block of data, and then send them to the server as a list file of various hash keys. A list of these keys is compared on the server, and then blocks with data are sent to the client.

This method significantly reduces the load on the network, since only unique data is transmitted.

Deduplication on the client

Implies performing an operation directly on the data source. Therefore, with such deduplication, the computing power of the client will be involved. After the process is completed, the data will be sent to the storage devices.

This type of deduplication is always implemented using software. And the main disadvantage of the described method is the high load on the client’s RAM and processor. The key advantage lies in the ability to transfer data over a low-bandwidth network.

Deduplication on the server

It is used when data is sent to the server in completely raw form — without encoding and compression. This type of deduplication is divided into software and hardware.

Hardware Type

It is implemented on the basis of a deduplication device, which is provided in the form of a specific hardware solution that combines the logic of deduplication and the data recovery procedure.

The advantage of this method is the ability to transfer the load from server capacities to a specific hardware unit. The deduplication process itself gets maximum transparency at the same time.

Program type

It implies the use of special software, which, in fact, performs all the necessary deduplication processes. However, with this approach, it is always necessary to take into account the load on the server that will occur during the deduplication process.

Advantages and disadvantages

The positive aspects of deduplication as a process include the following points:

  • High efficiency. According to EMC research, the data deduplication process reduces the need for storage capacity by 10-30 times.
  • The benefits of using it with low network bandwidth. This is due to the transfer of exclusively unique data.
  • The ability to create backups more often and store data backups for longer.

The disadvantages of deduplication include:

  • The possibility of a data conflict if a pair of different blocks generate the same hash key at the same time. In this case, the database may be corrupted, which will cause a failure when restoring from a backup copy.
  • The larger the database volume, the higher the risk of a conflict situation. The solution to the problem is to increase the hash space.
We use cookies to optimise website functionality and improve our services. To find out more, please read our Privacy Policy.
Cookies settings
Strictly necessary cookies
Analytics cookies