1.概述功能
1.querys信息汇总(action,affected rows,avg.frequency,avg.latency,bytesin,bytesout,count,cputime,errors,failedrate)
2.趋势 环比
3.faults requests continue to arrive but do not get serviced by the system(mysqld,disk) {why?so you can prevent it from escalating into an outage?are some of my database problems cause by small,hidden faults?}
3.资源
2.how to architect and build highly observable systems
external quality of service and internal sufficieny of resources
- customers viewpoint(external,if these four no problem,no prolem for customers):
concurrency(request in process,backlog),
error rate,
latency(wait+process,99 percentile better),
throughput(complect request per second over a time interval)
- internal (Brendan Greggis USE):
utilization(cpu,mem,storage,net)
saturation
errors
- formal performance thory,queue,little's law
文中说了些经验,比如log level waring没有用,fatal起始是panic不如retyrn error。error不知道是否经过处理。所以建议只有两个log info/debug,其中debug充足可关
还有应该有online profiling的能力,what request,what states,canceling
在DB领域分析时注意reduce diversity of workload分类
adative fault detection versus anomaly detection
原因:resource overload/saturation, storms of queries, bad application behavior, SELECT FOR UPDATE or other locking queries, internal scalability problems such as mutex contention around a long- running operation, or intensive periodic tasks. For example, the query cache mutex in MySQL has caused server stalls, and InnoDB checkpoint stalls or dirty-page flushing can cause similar effects. In the application layer, bad behavior such as stampedes to regenerate an expired cache entry are common culprits.
work isn't getting done
why not anomalous?
1.This is because systems are continually anomalous in a variety of ways. We humans tend to greatly underestimate how crazily our systems behave all the time.
2.false-alarm rate vs missed-alarm rate
They will miss most true faults and alarm you on things that are just “normal abnormality.”
3.practical scalability analysis with the universal scalabiliy law
scalability with througput
- 理论公式
scala 非线性原因:contention(比如数据分发聚合) and crosstalk
Neil Gunther's USL:effects of linear speedup,contention delay,coherency delay due to crosstalk
coefficient of performance: lmd,scala:N,throughout X
Amdahl's Law: X(N)=lmd * N / (1+sgm * (N-1))
USL(add coherency):
- 参数定义
去noise:scatterplot and time-series to ensure you're working with a relatively consistent set of data. 去除离散点或者选择特定时间的均值数据
USL package from CRAN
scalability with response time
Little's Law: N=XR (N是concurrency。)
response time vs througput
leanerly scalable system:R(X)=1/lmd
add coefficient:
add coherency:
反之:
R不仅仅与X有关,it's pointless
。当有retrograde发生时,It's a lost cause with no pratical purpose
限制
用USL只能是inscreasing scalal until queueing。比如增加线程数到每个core一个,无法forecast how many servers are serving queues。
因为已经different model
了
评估
拟合参数,求出Nmax
根据能接受的response time 求出N
求出Throughput
优化scalability
根据参数看出contention,crosstalk的程度
注意点
每个scala的其他变量保持不变。比如每个node因那个该和原来一样的输入,持有一样的数据等等。
4.estimating cpu per query with weighted linear regression
CPU-execution-time
有些请求,are aggregated and defferred to be done later,often with a gingle IO op,because of disconnect,acuurate cost accounting is impossible。
如果用每个query的特征做线性聚合不是一个好方法,
比如query1(是个不那么频繁所以对整体cpu贡献稍少)的执行时间,rows等等,因为可能query在最开始的时候分配解析之类的占用cpu高,后面Io操作时候cpu就会少,所以会得出类似CPU=-K * Rquery的错误。所以应该在每个time series上 frame-by-frame分析cpu,io与当前frame下的query特性做线性回归。
哎,论文中简单的假设写的一点不简单明了,好多东西明明很简单,就是不写出来,这里边有个最基本的假设是每个query每个frame下cpu的分配都是一样的(不管query什么类型内部在做什么,就按照query的数量分配cpu)
evaluate:
数值上 1.与现有tool.2.与已知结果比较
1.goodness of fit —— correlation coefficient;R方
2.standard error —— error terms;T-staistics for the slope;intercept
3.statistical significance —— mean absolute percentage error(MAPE)
解释性
可见性
猜测IO应该用影响行数?
practical query optimization
response time,consistency
1.you can't improve what you don't measure
performance schema: aggregates
slow query:
tcp traffic cap:tcpdump,libpcap
2.分类(利用performance shcema,pt-query-digest,自己打的怎么聚合呢=》自己写代码了,可以参考DBSeer)
不要太多。top20左右。
one-by-one会看不出来问题,比如很多短频快的可以合成一个
queueing theory
- queue
improving system utilization,availability,throughput,but at the expense of latency and resource consumption
- when happen?
even when there's more than enough capacity to do the work
irregular arrivals
irrugular job sizes
waste
- get worse
high utilization,variability,few servers
理论
Little's Law:
L = A * R
Lq = A * Wq
U = A * S/M = A/u
队列基类模型:
M/M/1,M/M/m,M/G/1,M/G/m
第一列M 到达随机独立,可以用poisson process生成,指数分布,mean为1/A, 标准Markovian or Memoryless
第二列的G没有分布假设,可以假设为Gaussian分布
第三个是servers数量
M/M/1:
R = S / (1-U)
L = U/(1-U)
=>Lq = U^2/(1-U), Wq = (U * S)/(1-U)
http://perfdynamics.blogspot....
这里证明R的percentile分布R50~2/3R,R95~9/3R
M/M/m
ErlangC 公式:太复杂。。
一个近似(对M,G都适用):
当M时右侧是1
一个简单原则
剩余能力扩展平方根。
比如现在有10个servers,利用率80%。假设高峰要到3倍。应该需要多少servers?
8*3+2*根号3 = 27.46
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。