This is the 38th article in the Technical Writing Project (including translation). Set a small goal of 999, at least 2 articles per week.

I love TiDB’s design philosophy. For example, should databases be either OLAP or OLTP? Why not both? When the amount of data is large, does it have to be anti-human? Do you have any fun being a distributor? Maintain the crematorium!

Especially in small and medium sized teams, building a hadoop for data analysis is like raising a cow from a calf for milk. All the way to the dark pit.

Consider learning cost, migration cost (highly compatible with MySQL- not 100%), operation cost (support Ansible- the team has Ansible operation experience), usage cost (compared to Hadoop), hardware cost (compared to Hadoop), benefits (no need to separate tables, support OLAP and OLTP, Support distributed transactions, support TiSpark, support TikV, with its own synchronization tools), etc.

Ok, suspected advertising a words are ended, anyway, to introduce how miserable in the whole building power failure, and just tikv file is damaged, and how to Tidb, remote text, under the guidance of the mentality from the delete library run road, step by step into and possibly save, and, I wipe, really bumpy road journey back.

Because other colleagues were in charge of Tidb before, I just took over the job, and I still have a rudimentary grasp of the whole Tidb. True · Fault oriented learning!

The full text is mainly on the backtracking of the accident, more trivial details, mind can directly see the last.

Cluster environment

name ip service
tikv-1 192.168.1.200 tikv
tikv-2 192.168.1.201 tikv
tikv-3 192.168.1.202 Tikv (bad region)
pd 192.168.1.203 pd
tidb 192.168.1.204 tidb,monitoring
sn-data-node-1 192.168.1.216 tidb
sn-data-node-2 192.168.1.217 tidb
tikv-4 192.168.1.218 Tikv (bad region)

The beginning of disastrous

SSH to tiDB ansible-playbook start.yml after an unexpected power outage

Things are a little bit bad, but can’t give up, stop and start a pass, still this result. Things are getting a little messy.

Fortunately, before secretly lurking in the official group of TiDB, nothing to listen to the big guy blowing fart, how much influenced. Roll up your sleeves and get started.

Location problem

Round 1 The fruit on the tree

First look at the official documentation


Since is tidb don’t come, you see first tidb log (actually should look at http://prometheus:9090/targets, because not familiar with, so go wrong, why don’t see grafana, because tidb that card to the back, ansible automatically withdrew, No grafana)






df -i
df -h



At this point, by looking at the official documentation, and taking a stab at Prometheus,






Overall architecture of TiDB


Summary: Round 1 ends with two TiKV downs, shamefully inefficient.

Round 2 Sit in front of the tree

Note that if there is no special explanation below, the command should be executed under TiKV shutdown state.



grep -B 50 Welcome

For more grep methods (-a -b-c), refer to man grep or grep (1). Since tikV starts to print Welcome, it is reasonable to think that the Welcome before each Welcome must be the reason for the last exit.






./pd-ctl store -d -u http://127.0.0.1:2379




At this time the saviour provided TiKV Control use instructions # recover damaged -MVCC – data





Round 2 is over, and a lifeline is found, TiKV Control and PD Control, but it’s not that simple.

Round 3 Is getting better

/home/tidb/tidb-ansible/resources/bin/pd-ctl -u “http://172.16.10.1:2379” -d store delete 10 TiKV is hopeless, store delete is performed.







curl -X POST http://${pd_ip}:2379/pd/api/v1/store/${store_id}/state? state=Up

According to the



tikv-ctl --db /path/to/tikv/db bad-regions





@ QiZheng



ansible-playbook stop.yml  -l tikv_servers


As a tombstone, tikv-ctl –db /path/to/tikv/db tombstone -p 127.0.0.1:2379 -r will be used as a tombstone

Run the pd-ctl region 31101 command to discover the region



The operator add remove-peer command fails to delete the peer.

At this time Qi Zheng big man appeared.



recover-mvcc
operator add remove-peer

Round 4 looks like a solution

Do not destroy a whole pot of soup or part of the region because of one rat drool. Try to restore the region forcibly first and make sure the rest is normal.










At this time under the guidance of @ Qi Zheng boss, upgrade tiKV-CTL,





After the above operations are repeated, the node is finally up

218 this one has six bad ones



unsafe-recover remove-fail-stores


The Round 4 service is ready to start

  1. First stop tikv
  2. If the number of bad regions is less than half, try recover-mvcc
  3. If it’s more than half, it doesn’t matter, Unsafe-recover remove fail-stores, tikv-ctl –db /path/to/tikv/db tombstone -p 127.0.0.1:2379 -r 31101,xx,xx,xx
  4. Start again tikv

Round 5 Final Round

You think everything’s all right? Fate is a trick.


We lost some, but not much.

conclusion

  1. Not familiar with TiDB, a lot of superficial
  2. Not familiar with TiDB documents and tools
  3. TiDB documentation is not very clear, for example, in troubleshooting, there is no inner chain like PD-ctl, tiKV-ctl, even not mentioned, in pD-ctl and tiKV-ctl tools, there is no mention of how to download, in tool download, there is no mention of what tools are included. Very Buddha is
  4. Thanks to the kind guidance of the big boys in the group
  5. If it is tikv, stop tikv first
  6. If the number of damages is less than half, try recover-mvcc
  7. For more than half, try unsafe-recover remove-fail-stores and store as a tombstone
  8. Start again tikv
  9. Can be combined with tiDB damaged tikV node how to restore the cluster to do.
  10. Experimenting, especially doing extreme testing, and trying to deal with it, will give you a lot of experience.
  11. I mean, there’s no Ryan and no tail, but it’s better to be missing an arm than dead.

A little advertisement

If you ignore the OLTP scenario, you can also try clickHouse. Here are some articles from ClickHouse that were sorted out earlier.

  • 031- Redash for Data Visualization (43 data sources supported)
  • 033- Most comprehensive – 5 ways to migrate mysql to ClickHouse
  • 035- Resolves streamsets FULL JDBC mode data duplication issues

The resources