I wish this help you As a partial answer, we found that in the worker nodes, the GC was causing lots of long pauses (3~5 secs) every six hours (the predefined GC span). We increased the heap from 1GB to 4GB and seems solved. What is causing the heap going constantly full is still an open question, but is beyond the scope of this. After the heap increase, there are no more errors (related to this) in the logs.
wish of those help The right choice for your use case will be using WebHDFS api. It supports the systems running outside Hadoop clusters to access and manipulate the HDFS contents. It doesn't require the client systems to have hadoop binaries installed, you could manipulate remote hdfs over http using CURL itself. Please refer,
How Name node update availability of Data Nodes for HDFS writes in Hadoop
this will help Data will be written to just one datanode by client, rest replication is taken care by the datanodes itself on namenode instruction. Replica placement: while a datanode receives data of the block from the client, the datanode saves the data in a file, which represents the block, and, simultaneously re-sends the data to another datanode, which is supposed to create another replica of the block.