Network connection was down from primary to replica for some time and we
restored it. But primary could not connect to resorted secondary.
We took a restart of whole cluster and then thyis error was gone
From what I have seen so far, I believe the underlying issue is network
connectivity/configuration within your deployment. My understanding so far
- all servers are experiencing the same issue intermittently
- the issue seems to be spread across the whole cluster (i.e. sometimes
on mongos and other times on individual mongod)
- network connectivity issues within a replica set
- error messages in the form of “HostNotFound: unable to resolve DNS for
host” or “Couldn’t get a connection within the time limit”
- restarting the cluster seems to solve the problem for a while (likely
due to the refresh of the DNS cache)
All these signs seems to point that the issue is in your network setup
(e.g. DNS setup, network hardware issues, etc.) and not in your MongoDB
deployment. The output of sh.status() doesn’t seem to show any notable
One more thing we also confirmed the dbhash of all config servers and it
was all fine.
seems to indicate that the cluster is operating normally, the config
servers are consistent with each other (which is vital to the operation of
a sharded cluster), and cluster balancing seems to operate normally as well.
Is there a pattern to this issue? For example, did you observe these
network-related errors happening more often in some particular time, during
particular load, etc.?