WEIGHT BASED DEDUPLICATION FOR MINIMIZING DATA REPLICATION IN PUBLIC CLOUD STORAGE
Abstract
The approach to minimize data replication in cloud storage is one of the challenging issues to process text data. The amount of digital data has been increasing exponentially. There is a need to reduce the amount of storage space by storing the data efficiently. In cloud storage environment, the data replication provides high availability with fault tolerance system. An effective approach of deduplication system using a weight based method is proposed at the target level in order to reduce the storage space in cloud. Storage space can be efficiently utilized by removing the unpopular files from the secondary servers. Target level consumes less processing power than Source level deduplication. Input text documents are stored into Dropbox cloud. The Term Frequency (TF) and Named Entity Recognition (NER) of the documents are found. The text features found are stored in database using MySQL. After storing features in database, fresh text documents are collected to find popular and unpopular files. TF and NER are found for the freshly collected text documents and duplicate features are removed to compare with the features stored in the database. On comparison, relevant text documents are listed. After listing text documents, document frequency, document weight and threshold factor are found. Depending on threshold factor, the popular and unpopular files are detected. The popular files are replicated in all the storage nodes to achieve availability. Before deduplication, the storage space occupied in the Dropbox cloud is 8.09MB. After deduplication, the unpopular files are removed from secondary storage nodes and the storage space occupied in the Dropbox cloud is 4.82MB. Finally, data replications are minimized and 45.6% of the cloud storage space is efficiently saved by applying weight based deduplication.
Keyword(s)
Full Text: PDF (downloaded 759 times)
Refbacks
- There are currently no refbacks.