
刻盘。(当然你不喜欢,不然你也不会问了)
通过U盘或硬盘,利用ISOEMU引导器。具体有两种方法:1、纯DOS下使用ieldr.exe和isoemu.ini加载ISO并启动;2、利用grldr引导器传递控制权给ieldr文件,配合isoemu.ini加载ISO并启动。isoemu.ini的语法很简单,百度一下就行了。启动ISO后,你就可以安装了。
接下来这种方法不太适合初学者,Windows下的Linux安装。硬盘为Linux分出两个区(一个10G以上,另一个为内存的两倍)。用pq或其他分区格式化工具将其中较大的分区格式化为ext2或ext3或ext4文件系统格式,将另一个格式化为swap文件系统格式。接下来下载一个硬盘安装器,选择你下载的Linux的ISO镜像,或单独选择内核及你要安装的组件,选择安装到硬盘,选择你刚才分的较大的分区,点安装,(可能时间较长,等待到结束)。最后是安装引导,你可以选择syslinux或grub,记住,如果你还需要Windows的话,请一定要为Windows提供引导的机会。
天重点对linux网络数据包的处理做下分析,但是并不关系到上层协议,仅仅到链路层。之前转载过一篇文章,对NAPI做了比较详尽的分析,本文结合Linux内核源代码,对当前网络数据包的处理进行梳理。根据NAPI的处理特性,对设备提出一定的要求1、设备需要有足够的缓冲区,保存多个数据分组2、可以禁用当前设备中断,然而不影响其他的 *** 作。当前大部分的设备都支持NAPI,但是为了对之前的保持兼容,内核还是对之前中断方式提供了兼容。我们先看下NAPI具体的处理方式。我们都知道中断分为中断上半部和下半部,上半部完成的任务很是简单,仅仅负责把数据保存下来;而下半部负责具体的处理。为了处理下半部,每个CPU有维护一个softnet_data结构。我们不对此结构做详细介绍,仅仅描述和NAPI相关的部分。结构中有一个poll_list字段,连接所有的轮询设备。还 维护了两个队列input_pkt_queue和process_queue。这两个用户传统不支持NAPI方式的处理。前者由中断上半部的处理函数吧数据包入队,在具体的处理时,使用后者做中转,相当于前者负责接收,后者负责处理。最后是一个napi_struct的backlog,代表一个虚拟设备供轮询使用。在支持NAPI的设备下,每个设备具备一个缓冲队列,存放到来数据。每个设备对应一个napi_struct结构,该结构代表该设备存放在poll_list中被轮询。而设备还需要提供一个poll函数,在设备被轮询到后,会调用poll函数对数据进行处理。基本逻辑就是这样,下面看下具体流程。中断上半部:非NAPI:非NAPI对应的上半部函数为netif_rx,位于Dev.,c中int netif_rx(struct sk_buff *skb){int ret/* if netpoll wants it, pretend we never saw it *//*如果是net_poll想要的,则不作处理*/if (netpoll_rx(skb))return NET_RX_DROP/*检查时间戳*/net_timestamp_check(netdev_tstamp_prequeue, skb)trace_netif_rx(skb)#ifdef CONFIG_RPSif (static_key_false(&rps_needed)) {struct rps_dev_flow voidflow, *rflow = &voidflowint cpu/*禁用抢占*/preempt_disable()rcu_read_lock()cpu = get_rps_cpu(skb->dev, skb, &rflow)if (cpu <0)cpu = smp_processor_id()/*把数据入队*/ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail)rcu_read_unlock()preempt_enable()} else#endif{ unsigned int qtailret = enqueue_to_backlog(skb, get_cpu(), &qtail)put_cpu()}return ret}中间RPS暂时不关心,这里直接调用enqueue_to_backlog放入CPU的全局队列input_pkt_queuestatic int enqueue_to_backlog(struct sk_buff *skb, int cpu,unsigned int *qtail){struct softnet_data *sdunsigned long flags/*获取cpu相关的softnet_data变量*/sd = &per_cpu(softnet_data, cpu)/*关中断*/local_irq_save(flags)rps_lock(sd)/*如果input_pkt_queue的长度小于最大限制,则符合条件*/if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) {/*如果input_pkt_queue不为空,说明虚拟设备已经得到调度,此时仅仅把数据加入input_pkt_queue队列即可*/if (skb_queue_len(&sd->input_pkt_queue)) {enqueue:__skb_queue_tail(&sd->input_pkt_queue, skb)input_queue_tail_incr_save(sd, qtail)rps_unlock(sd)local_irq_restore(flags)return NET_RX_SUCCESS}/* Schedule NAPI for backlog device* We can use non atomic operation since we own the queue lock*//*否则需要调度backlog 即虚拟设备,然后再入队。napi_struct结构中的state字段如果标记了NAPI_STATE_SCHED,则表明该设备已经在调度,不需要再次调度*/if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {if (!rps_ipi_queued(sd))____napi_schedule(sd, &sd->backlog)}goto enqueue}/*到这里缓冲区已经不足了,必须丢弃*/sd->dropped++rps_unlock(sd)local_irq_restore(flags)atomic_long_inc(&skb->dev->rx_dropped)kfree_skb(skb)return NET_RX_DROP}该函数逻辑也比较简单,主要注意的是设备必须先添加调度然后才能接受数据,添加调度调用了____napi_schedule函数,该函数把设备对应的napi_struct结构插入到softnet_data的poll_list链表尾部,然后唤醒软中断,这样在下次软中断得到处理时,中断下半部就会得到处理。不妨看下源码static inline void ____napi_schedule(struct softnet_data *sd,struct napi_struct *napi){list_add_tail(&napi->poll_list, &sd->poll_list)__raise_softirq_irqoff(NET_RX_SOFTIRQ)}NAPI方式NAPI的方式相对于非NAPI要简单许多,看下e100网卡的中断处理函数e100_intr,核心部分if (likely(napi_schedule_prep(&nic->napi))) {e100_disable_irq(nic)//屏蔽当前中断__napi_schedule(&nic->napi)//把设备加入到轮训队列}if条件检查当前设备是否 可被调度,主要检查两个方面:1、是否已经在调度 2、是否禁止了napi pending.如果符合条件,就关闭当前设备的中断,调用__napi_schedule函数把设备假如到轮训列表,从而开启轮询模式。分析:结合上面两种方式,还是可以发现两种方式的异同。其中softnet_data作为主导结构,在NAPI的处理方式下,主要维护轮询链表。NAPI设备均对应一个napi_struct结构,添加到链表中;非NAPI没有对应的napi_struct结构,为了使用NAPI的处理流程,使用了softnet_data结构中的back_log作为一个虚拟设备添加到轮询链表。同时由于非NAPI设备没有各自的接收队列,所以利用了softnet_data结构的input_pkt_queue作为全局的接收队列。这样就处理而言,可以和NAPI的设备进行兼容。但是还有一个重要区别,在NAPI的方式下,首次数据包的接收使用中断的方式,而后续的数据包就会使用轮询处理了;而非NAPI每次都是通过中断通知。下半部:下半部的处理函数,之前提到,网络数据包的接发对应两个不同的软中断,接收软中断NET_RX_SOFTIRQ的处理函数对应net_rx_actionstatic void net_rx_action(struct softirq_action *h){struct softnet_data *sd = &__get_cpu_var(softnet_data)unsigned long time_limit = jiffies + 2int budget = netdev_budgetvoid *havelocal_irq_disable()/*遍历轮询表*/while (!list_empty(&sd->poll_list)) {struct napi_struct *nint work, weight/* If softirq window is exhuasted then punt.* Allow this to run for 2 jiffies since which will allow* an average latency of 1.5/HZ.*//*如果开支用完了或者时间用完了*/if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))goto softnet_breaklocal_irq_enable()/* Even though interrupts have been re-enabled, this* access is safe because interrupts can only add new* entries to the tail of this list, and only ->poll()* calls can remove this head entry from the list.*//*获取链表中首个设备*/n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list)have = netpoll_poll_lock(n)weight = n->weight/* This NAPI_STATE_SCHED test is for avoiding a race* with netpoll's poll_napi(). Only the entity which* obtains the lock and sees NAPI_STATE_SCHED set will* actually make the ->poll() call. Therefore we avoid* accidentally calling ->poll() when NAPI is not scheduled.*/work = 0/*如果被设备已经被调度,则调用其处理函数poll函数*/if (test_bit(NAPI_STATE_SCHED, &n->state)) {work = n->poll(n, weight)//后面weight指定了一个额度trace_napi_poll(n)}WARN_ON_ONCE(work >weight)/*总额度递减*/budget -= worklocal_irq_disable()/* Drivers must not modify the NAPI state if they* consume the entire weight. In such cases this code* still "owns" the NAPI instance and therefore can* move the instance around on the list at-will.*//*如果work=weight的话。任务就完成了,把设备从轮询链表删除*/if (unlikely(work == weight)) {if (unlikely(napi_disable_pending(n))) {local_irq_enable()napi_complete(n)local_irq_disable()} else {if (n->gro_list) {/* flush too old packets* If HZ <1000, flush all packets.*/local_irq_enable()napi_gro_flush(n, HZ >= 1000)local_irq_disable()}/*每次处理完就把设备移动到列表尾部*/list_move_tail(&n->poll_list, &sd->poll_list)}}netpoll_poll_unlock(have)}out:net_rps_action_and_irq_enable(sd)#ifdef CONFIG_NET_DMA/** There may not be any more sk_buffs coming right now, so push* any pending DMA copies to hardware*/dma_issue_pending_all()#endifreturnsoftnet_break:sd->time_squeeze++__raise_softirq_irqoff(NET_RX_SOFTIRQ)goto out}这里有处理方式比较直观,直接遍历poll_list链表,处理之前设置了两个限制:budget和time_limit。前者限制本次处理数据包的总量,后者限制本次处理总时间。只有二者均有剩余的情况下,才会继续处理。处理期间同样是开中断的,每次总是从链表表头取设备进行处理,如果设备被调度,其实就是检查NAPI_STATE_SCHED位,则调用 napi_struct的poll函数,处理结束如果没有处理完,则把设备移动到链表尾部,否则从链表删除。NAPI设备对应的poll函数会同样会调用__netif_receive_skb函数上传协议栈,这里就不做分析了,感兴趣可以参考e100的poll函数e100_poll。而非NAPI对应poll函数为process_backlog。static int process_backlog(struct napi_struct *napi, int quota){int work = 0struct softnet_data *sd = container_of(napi, struct softnet_data, backlog)#ifdef CONFIG_RPS/* Check if we have pending ipi, its better to send them now,* not waiting net_rx_action() end.*/if (sd->rps_ipi_list) {local_irq_disable()net_rps_action_and_irq_enable(sd)}#endifnapi->weight = weight_plocal_irq_disable()while (work <quota) {struct sk_buff *skbunsigned int qlen/*涉及到两个队列process_queue和input_pkt_queue,数据包到来时首先填充input_pkt_queue,而在处理时从process_queue中取,根据这个逻辑,首次处理process_queue必定为空,检查input_pkt_queue如果input_pkt_queue不为空,则把其中的数据包迁移到process_queue中,然后继续处理,减少锁冲突。*/while ((skb = __skb_dequeue(&sd->process_queue))) {local_irq_enable()/*进入协议栈*/__netif_receive_skb(skb)local_irq_disable()input_queue_head_incr(sd)if (++work >= quota) {local_irq_enable()return work}}rps_lock(sd)qlen = skb_queue_len(&sd->input_pkt_queue)if (qlen)skb_queue_splice_tail_init(&sd->input_pkt_queue,&sd->process_queue)if (qlen <quota - work) {/** Inline a custom version of __napi_complete().* only current cpu owns and manipulates this napi,* and NAPI_STATE_SCHED is the only possible flag set on backlog.* we can use a plain write instead of clear_bit(),* and we dont need an smp_mb() memory barrier.*/list_del(&napi->poll_list)napi->state = 0quota = work + qlen}rps_unlock(sd)}local_irq_enable()return work}函数还是比较简单的,需要注意的每次处理都携带一个配额,即本次只能处理quota个数据包,如果超额了,即使没处理完也要返回,这是为了保证处理器的公平使用。处理在一个while循环中完成,循环条件正是work <quota,首先会从process_queue中取出skb,调用__netif_receive_skb上传给协议栈,然后增加work。当work即将大于quota时,即++work >= quota时,就要返回。当work还有剩余额度,但是process_queue中数据处理完了,就需要检查input_pkt_queue,因为在具体处理期间是开中断的,那么期间就有可能有新的数据包到来,如果input_pkt_queue不为空,则调用skb_queue_splice_tail_init函数把数据包迁移到process_queue。如果剩余额度足够处理完这些数据包,那么就把虚拟设备移除轮询队列。这里有些疑惑就是最后为何要增加额度,剩下的额度已经足够处理这些数据了呀?根据此流程不难发现,其实执行的是在两个队列之间移动数据包,然后再做处理。为了解决LVS ksoftirqd CPU使用率100%导致网卡软中断丢包,我和同事们一起搜索了大量的资料去分析问题,特别是感谢美团技术团队的分享帮助我们快速梳理优化思路,最后明确了如何重构RPS和RFS网卡多队列的优化脚本。个人认为这是一个大家可能普遍会遇到的问题,文章内的分析思路和解决方案未必是最优解,也欢迎各位分享自己的解决方法。
2019年07月03日 - 初稿
阅读原文 - https://wsgzao.github.io/post/rps/
扩展阅读
Redis 高负载下的中断优化 - https://tech.meituan.com/2018/03/16/redis-high-concurrency-optimization.html
我们遇到的问题属于计划外的incident,现象是某产品用户在线率突然降低,LVS Master同时收到CPU High Load告警,检查发现该节点出现网卡大量断开重连和丢包情况,应急切换到LVS Slave也出现上述问题,在排除掉流量异常和外部攻击后选择切换DNS到背后的Nginx Real Servers后服务逐步恢复。
复盘核心原因在于系统初始化时rps优化脚本没有成功执行,这个脚本起初是因为早期DBA团队遇到过CPU负载较高导致网卡异常,这个优化脚本也一直传承至今,却已经没有人知道为什么添加。现在大多数服务器没有执行成功而被大家一直所忽视显然也是post check没有做到位。在早期大家都停留在Bash Shell运维的阶段,没有专职的团队来管理确实容易失控,好在现在可以基于Ansible来做初始化和检查,运维的压力也减轻了一部分。
通过Google搜索相关知识的过程中,我们也发现在不少人都会遇到这样类似的问题。比如这篇文章提到 lvs/irq
lvs 的性能问题,软中断耗尽 CPU 单核后到达处理极限
和华为的工程师们在交换经验的时候对方分享了一个关于RSS和RPS关系图,之后的内容还会引用美团技术团队的分析
我们遇到的情况是缺少可用服务器资源选择把用户外部请求流量和Codis Cache Cluster内部流量临时混在了同一个LVS上,虽然看上去CPU和traffic的整体压力都不算高,但是CPU的处理压力可能恰好集中在了和外网Bond1网卡相同的Core上最后引起了ksoftirqd软中断,而内网Bond0网卡就没有监控到任何丢包。虽然我们也有正常开启 irqbalance ,但不清楚是不是因为受到 cpupower performance 和 NUMA 的影响最后也没能阻止事故的发生,最终的优化方案主要是手动开启RPS和RFS,大致步骤如下:
This document describes a set of complementary techniques in the Linux
networking stack to increase parallelism and improve performance for
multi-processor systems.
The following technologies are described:
https://www.kernel.org/doc/Documentation/networking/scaling.txt
RECEIVE PACKET STEERING (RPS)
Receive Packet Steering (RPS) is similar to RSS in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.
RPS has several advantages over hardware-based RSS:
RPS is configured per network device and receive queue, in the /sys/class/net/*device*/queues/*rx-queue*/rps_cpus file, where device is the name of the network device (such as eth0 ) and rx-queue is the name of the appropriate receive queue (such as rx-0 ).
The default value of the rps_cpus file is zero. This disables RPS, so the CPU that handles the network interrupt also processes the packet.
To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue.
The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15).
For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.
For network devices with multiple queues, there is typically no benefit to configuring both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS may still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain.
RECEIVE FLOW STEERING (RFS)
Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files:
/proc/sys/net/core/rps_sock_flow_entries
Set the value of this file to the maximum expected number of concurrently active connections. We recommend a value of 32768 for moderate server loads. All values entered are rounded up to the nearest power of 2 in practice.
/sys/class/net/*device*/queues/*rx-queue*/rps_flow_cnt
Replace device with the name of the network device you wish to configure (for example, eth0 ), and rx-queue with the receive queue you wish to configure (for example, rx-0 ).
Set the value of this file to the value of rps_sock_flow_entries divided by N , where N is the number of receive queues on a device. For example, if rps_flow_entries is set to 32768 and there are 16 configured receive queues, rps_flow_cnt should be set to 2048 . For single-queue devices, the value of rps_flow_cnt is the same as the value of rps_sock_flow_entries .
Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs.
Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order.
接收数据包是一个复杂的过程,涉及很多底层的技术细节,但大致需要以下几个步骤:
NIC 在接收到数据包之后,首先需要将数据同步到内核中,这中间的桥梁是 rx ring buffer 。它是由 NIC 和驱动程序共享的一片区域,事实上, rx ring buffer 存储的并不是实际的 packet 数据,而是一个描述符,这个描述符指向了它真正的存储地址,具体流程如下:
当驱动处理速度跟不上网卡收包速度时,驱动来不及分配缓冲区,NIC 接收到的数据包无法及时写到 sk_buffer ,就会产生堆积,当 NIC 内部缓冲区写满后,就会丢弃部分数据,引起丢包。这部分丢包为 rx_fifo_errors ,在 /proc/net/dev 中体现为 fifo 字段增长,在 ifconfig 中体现为 overruns 指标增长。
这个时候,数据包已经被转移到了 sk_buffer 中。前文提到,这是驱动程序在内存中分配的一片缓冲区,并且是通过 DMA 写入的,这种方式不依赖 CPU 直接将数据写到了内存中,意味着对内核来说,其实并不知道已经有新数据到了内存中。那么如何让内核知道有新数据进来了呢?答案就是中断,通过中断告诉内核有新数据进来了,并需要进行后续处理。
提到中断,就涉及到硬中断和软中断,首先需要简单了解一下它们的区别:
当 NIC 把数据包通过 DMA 复制到内核缓冲区 sk_buffer 后,NIC 立即发起一个硬件中断。CPU 接收后,首先进入上半部分,网卡中断对应的中断处理程序是网卡驱动程序的一部分,之后由它发起软中断,进入下半部分,开始消费 sk_buffer 中的数据,交给内核协议栈处理。
通过中断,能够快速及时地响应网卡数据请求,但如果数据量大,那么会产生大量中断请求,CPU 大部分时间都忙于处理中断,效率很低。为了解决这个问题,现在的内核及驱动都采用一种叫 NAPI(new API)的方式进行数据处理,其原理可以简单理解为 中断 + 轮询,在数据量大时,一次中断后通过轮询接收一定数量包再返回,避免产生多次中断。
由于接收来自外围硬件 (相对于 CPU 和内存) 的异步信号或者来自软件的同步信号,而进行相应的硬件、软件处理;发出这样的信号称为进行中断请求 (interrupt request, IRQ)
1.top 按下数字键 1
2.mpstat -P ALL 2
mpstat使用介绍和输出参数详解 - https://wsgzao.github.io/post/mpstat/
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)