背景

最近在做远程控制的功能，使用的技术方案是远程控制的鼠标数据流和桌面画面视频流分开传输。这样观看端就需要把鼠标的画面和桌面的画面进行合并渲染。如果在数据层面做利用 FFmpeg 框架就可以实现两路视频流的合并。这个可能在直播场景比较适合，由于视频会议对实时性要就比较高，用户在远程控制时，延时忍受度较低，所以为了体验就采用了客户端端在渲染时，来合并两路流进行显示。

这里就引申到本文讨论的话题了，在 iOS 平台用 Metal 如何来渲染两路视频流到一个视图上？

1. Metal 是如何渲染图形的

首先我们先看下图形如何通过GPU显示出来的。下面是一个简单的示意图。 Description 渲染过程通过 CPU 给 GPU 发送相应的渲染指令，然后把数据拷贝到 GPU 中，用 GPU 渲染上屏。从上图的过程可以看出，图形渲染的过程 CPU 负责组装指令，而 GPU 负责上屏渲染这个过程。那Metal 其实就是封装了 GPU 图形化的接口，方便开发者调用 GPU 的能力。整个渲染过程和大多数图形框架 OpenGL Vulkan 基本相似，我们来看下这个流程。

代码上配置的 GPU 指令，这个配置过程是在 CPU 中执行的，有兴趣可以看下官方文档。下图是 Metal 框架中可以配置的命令的种类。上图中位块命令编码器（Blit Command Encode）是用于处理数据传输和图像处理的操作。它允许开发者高效地在 GPU 上执行内存拷贝、图像缩放、图像转换等任务。计算命令编码器(Compute Command Encoder)用于编码在 GPU 上执行计算内核的命令。可以执行通用计算，例如图像处理、机器学习或任何并行化的任务。而我们做图形渲染主要关心的是渲染编码器(Render Command Encoder)如何使用的，渲染编码器需要配置的有顶点坐标数据、纹理数据、采样信息、深度信息然后发送到 GPU中，从此图中也可以看出来，除了图形处理外，同时 Metal 还可以做机器学习相关的计算。
下面看下渲染指令的配置过程，下图就描述了一个渲染指令的创建过程其中可以看到 Vertex Function 和 Fragment Function 就是 GPU 可以被编程的核心，通过注入顶点和片段着色器就可以利用 GPU 自定义渲染的逻辑。Metal 会把这些指令编译优化后，配置到 GPU中。
最后就开始执行 Metal 的渲染，下图就是渲染出最终图形这个过程中 GPU 需要执行的各个阶段，也就是我们常说的渲染管道 PipLine。渲染管道的流程就是从上面我们配置的数据和自定义的渲染逻辑，然后计算出相应位置的像素点位信息，经过深度和模版测试(本质上就是对深度信息、透明度、以及叠加视图进行裁剪和融合操作)，渲染到缓冲区，最终根据屏幕的刷新率渲染上屏。

上述就是 Metal 渲染 Pipline 的基本描述，有兴趣可以下载官方示例代码。从上面的配置过程可以看出，开发 Metal 渲染要做的两件事情，在 CPU 中配置渲染的 GPU 指令，通过框架提供的可编程PipLine，编写自定义的 Shader 给 GPU，执行定制化的渲染。

2. WebRTC 的渲染架构

在我们开始写视频合流的逻辑时，先来看下我们音视频会议用的 WebRTC 框架，是如何使用 Metal 来渲染的？下图是 WebRTC Metal 渲染的类图。从上图看还是比较简单的，获取解码后的视频数据帧，传递给 RTCMTLVideoView 后，RTCMTLVideoView 通过 DisplayLink 定时器指定渲染帧率，然后根据视频帧的不同颜色格式，分发给不同的 Render 渲染器执行渲染。

然后我们鼠标合流的过程可以用下面的流程图描述下。 Description

流程图中鼠标渲染的逻辑也很简单，在渲染视频帧的时候，发现相应时间点有鼠标帧过来时，就通过 Metal 合流的方式渲染上屏。下面我们就详细来看下 Metal 渲染上屏的代码。

3. Metal 渲染示例代码

3.1 渲染管道配置

开发 Metal 渲染的代码，先看下 CPU 如何配置渲染管道这个过程，下面是一个示例代码。

- (nonnull instancetype)initWithMetalKitView:(nonnull MTKView *)mtkView
{
    self = [super init];
    if(self)
    {
        _device = mtkView.device;

        NSURL *imageFileLocation = [[NSBundle mainBundle] URLForResource:@"background"
                                                           withExtension:@"tga"];
        
        _texture = [self loadTextureUsingAAPLImage: imageFileLocation];

        // Set up a simple MTLBuffer with vertices which include texture coordinates
        static const AAPLVertex quadVertices[] =
        {
            // Pixel positions, Texture coordinates
            { {  250,  -250 },  { 1.f, 1.f } },
            { { -250,  -250 },  { 0.f, 1.f } },
            { { -250,   250 },  { 0.f, 0.f } },

            { {  250,  -250 },  { 1.f, 1.f } },
            { { -250,   250 },  { 0.f, 0.f } },
            { {  250,   250 },  { 1.f, 0.f } },
        };

        // Create a vertex buffer, and initialize it with the quadVertices array
        _vertices = [_device newBufferWithBytes:quadVertices
                                         length:sizeof(quadVertices)
                                        options:MTLResourceStorageModeShared];

        // Calculate the number of vertices by dividing the byte length by the size of each vertex
        _numVertices = 6;

        /// Create the render pipeline.
        // Load the shaders from the default library
        id<MTLLibrary> defaultLibrary = [_device newDefaultLibrary];
        id<MTLFunction> vertexFunction = [defaultLibrary newFunctionWithName:@"vertexShader"];
        id<MTLFunction> fragmentFunction = [defaultLibrary newFunctionWithName:@"samplingShader"];

        // Set up a descriptor for creating a pipeline state object
        MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
        pipelineStateDescriptor.label = @"Texturing Pipeline";
        pipelineStateDescriptor.vertexFunction = vertexFunction;
        pipelineStateDescriptor.fragmentFunction = fragmentFunction;
        pipelineStateDescriptor.colorAttachments[0].pixelFormat = mtkView.colorPixelFormat;

        NSError *error = NULL;
        _pipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor
                                                                 error:&error];

        NSAssert(_pipelineState, @"Failed to create pipeline state: %@", error);

        _commandQueue = [_device newCommandQueue];
    }

    return self;
}

通过上面的代码可以看出，主要是在配置顶点向量和着色器函数。我们画如下的图来看下顶点坐标是如何定义的。 Description 可以看出来中心点是 (0,0) 。因为我们视频帧只用了 2D 坐标，如果 3D 坐标的话，坐标系如下。 Metal 渲染使用的是左手坐标系，如果是OpenGL右手坐标系，Z 轴就是反向的，个人感觉左手更符合人的直觉。

这里我们要注意的配置渲染顶点坐标的类型，用的 MTLPrimitiveTypeTriangle。MTLPrimitiveTypeTriangleStrip 和 MTLPrimitiveTypeTriangle 是 Metal 框架中用于指定图形渲染顶点数据类型的枚举值。它们之间的主要区别在于如何处理顶点数据以形成三角形。

MTLPrimitiveTypeTriangle 这种类型表示独立的三角形。在绘制时，每三个顶点组成一个三角形。因此如果你有 N 个三角形，你需要提供 3 * N 个顶点。每个三角形的顶点之间没有共享，所有的三角形都是独立的。
MTLPrimitiveTypeTriangle，这种类型表示对三角形顶点做适当裁剪。在绘制时，第一个三角形由前两个顶点和第三个顶点组成，之后的每个新顶点都会与前两个顶点一起形成一个新的三角形。这样绘制 N 个三角形只需要 N + 2 个顶点，效率更高，因为可以减少顶点数据的传输。

MTLPrimitiveTypeTriangle 用于独立三角形，而 MTLPrimitiveTypeTriangleStrip 用于通过共享顶点来高效地绘制一系列相连的三角形。选择哪种类型取决于你的具体需求和数据结构。下面的这个顶点坐标类型就是完整的三角形模式，后面可以看到我们实际项目中用的是 MTLPrimitiveTypeTriangleStrip ，因为视频帧是 2D 平面比较简单，可以用4个顶点代表一个正方形。

3.2 渲染命令提交

提交渲染命令给 GPU 的的过程中。看下面的示例代码。

- (void)drawInMTKView:(nonnull MTKView *)view
{
    // Create a new command buffer for each render pass to the current drawable
    id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
    commandBuffer.label = @"MyCommand";

    // Obtain a renderPassDescriptor generated from the view's drawable textures
    MTLRenderPassDescriptor *renderPassDescriptor = view.currentRenderPassDescriptor;

    if(renderPassDescriptor != nil)
    {
        _viewportSize.x = view.bounds.size.width;
        _viewportSize.y = view.bounds.size.height;
        id<MTLRenderCommandEncoder> renderEncoder =
        [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
        renderEncoder.label = @"MyRenderEncoder";

        // Set the region of the drawable to draw into.
        [renderEncoder setViewport:(MTLViewport){0.0, 0.0, _viewportSize.x, _viewportSize.y, -1.0, 1.0 }];

        [renderEncoder setRenderPipelineState:_pipelineState];

        [renderEncoder setVertexBuffer:_vertices
                                offset:0
                              atIndex:0];

        [renderEncoder setVertexBytes:&_viewportSize
                               length:sizeof(_viewportSize)
                              atIndex:AAPLVertexInputIndexViewportSize];

        // Set the texture object.  The AAPLTextureIndexBaseColor enum value corresponds
        ///  to the 'colorMap' argument in the 'samplingShader' function because its
        //   texture attribute qualifier also uses AAPLTextureIndexBaseColor for its index.
        [renderEncoder setFragmentTexture:_texture];

        // Draw the triangles.
        [renderEncoder drawPrimitives:MTLPrimitiveTypeTriangle
                          vertexStart:0
                          vertexCount:6];

        [renderEncoder endEncoding];

        // Schedule a present once the framebuffer is complete using the current drawable
        [commandBuffer presentDrawable:view.currentDrawable];
    }

    // Finalize rendering here & push the command buffer to the GPU
    [commandBuffer commit];
}

[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangle vertexStart:0 vertexCount:_numVertices];这里就可以看到我们提交给 GPU 的顶点向量的类型为 MTLPrimitiveTypeTriangle ，所以正方形的话需要2个三角形来表示。这里还有个注意点，我们顶点向量传递的是实际像素的话，需要额外传递下 _viewportSize ，因为我们之后片段着色器用到的顶点位置其实都是相对位置，不是实际的像素位置，主要是为了方便纹理贴图时，计算纹理位置。之后我们也会讲到。

3.3 数据传递

内存数据传输，例如 [texture replaceRegion:region mipmapLevel:0 withBytes:image.data.bytes bytesPerRow:bytesPerRow]; 给片段着色器传递数据，本质上就是 CPU 把内存的数据给 GPU 的内存。这里要注意颜色格式问题，一定要约定好像素的格式，MTLPixelFormatBGRA8Unorm 表示的32位的纹理数据。不然数据拷贝的时候会因为格式问题导致拷贝的数据 GPU 无法处理。

- (id<MTLTexture>)loadTextureUsingAAPLImage: (NSURL *) url {
    
    AAPLImage * image = [[AAPLImage alloc] initWithTGAFileAtLocation:url];
    NSAssert(image, @"Failed to create the image from %@", url.absoluteString);
    MTLTextureDescriptor *textureDescriptor = [[MTLTextureDescriptor alloc] init];
    // Indicate that each pixel has a blue, green, red, and alpha channel, where each channel is
    // an 8-bit unsigned normalized value (i.e. 0 maps to 0.0 and 255 maps to 1.0)
    textureDescriptor.pixelFormat = MTLPixelFormatBGRA8Unorm;
    
    // Set the pixel dimensions of the texture
    textureDescriptor.width = image.width;
    textureDescriptor.height = image.height;
    
    // Create the texture from the device by using the descriptor
    id<MTLTexture> texture = [_device newTextureWithDescriptor:textureDescriptor];
    
    // Calculate the number of bytes per row in the image.
    NSUInteger bytesPerRow = 4 * image.width;
    
    MTLRegion region = {
        { 0, 0, 0 },                   // MTLOrigin
        {image.width, image.height, 1} // MTLSize
    };
    
    // Copy the bytes from the data object into the texture
    [texture replaceRegion:region
                mipmapLevel:0
                  withBytes:image.data.bytes
                bytesPerRow:bytesPerRow];
    return texture;
}

上面是纹理数据配置的代码，我们讲下纹理坐标位置，纹理本质就是贴图，因为图片都是2D的，所以一般都是用 float2 向量定义的。float2 这种变量是 SIMD （单指令多数据结构）代表的就是2维的平面向量(x,y)。下图就是纹理坐标的定义。 Description 纹理坐标都是用相对位置计算的，这样着色时方便和顶点向量位置进行计算，尤其涉及到缩放和位移时，相对位置的优势体现出来。

3.4 着色器的编写

下面就是顶点着色器和片段着色器代码编写了。

我们就可以看下顶点着色器如何工作了，下面是顶点着色器代码。

typedef struct {
      float2 position;
      float2 texcoord;
} Vertex;

typedef struct {
      float4 position[[position]];
      float2 texcoord;
} RasterizerData;

 vertex RasterizerData vertexPassthrough(const device Vertex * verticies[[buffer(0)]],
                                   unsigned int vid[[vertex_id]]) {
      RasterizerData out;
      const device Vertex &v = verticies[vid];

      // Get the viewport size and cast to float.
      float2 viewportSize = float2(*viewportSizePointer);

       // To convert from positions in pixel space to positions in clip-space,
       //  divide the pixel coordinates by half the size of the viewport.
       // Z is set to 0.0 and w to 1.0 because this is 2D sample.
      out.position = vector_float4(0.0, 0.0, 0.0, 1.0);
      out.position.xy = pixelSpacePosition / (viewportSize / 2.0);
      out.texcoord = v.texcoord;

      return out;
 }

先来看几个概念，float4 position [[position]]; 这个是属性限定的语法结构，顾名思义就是这个属性限定它的使用方式。position 代表就是从顶点向量中获取到的裁剪空间位置，用来输出裁剪空间的位置信息给片段着色器。如何理解裁剪空间位置，裁剪的过程其实就是把我们定义好的顶点向量构成三角形，然后根据传递的视图实际像素大小，进行像素插值裁剪，然后计算出相应的位置。例如下面这个顶点着色器函数 vertexPassthrough ，其中定义的变换输出的结构 Varyings out; 这个变量的值赋值是通过 const device Vertex &v = verticies[vid]; 而这里面的 vid 本质上就是实际屏幕的像素位置，比如你的屏幕是320 px，采样的时候就会从 0 到 320 计算三角形实际的位置，然后赋值给 out。这里也可以看到实际像素都做了相对位置的变换，方便之后使用。

顶点着色器的输出，就是片段着色器的输入。VertexIn in [[stage_in]]; 表示该变量是顶点着色器的输出，传入已经插值裁剪好的像素点位，给片段着色器使用，以便于给每个点位分配颜色值。

fragment float4
samplingShader(RasterizerData in [[stage_in]],
               texture2d<half> colorTexture [[ texture(0) ]])
{
    constexpr sampler textureSampler (mag_filter::linear,
                                      min_filter::linear);

    // Sample the texture to obtain a color
    const half4 colorSample = colorTexture.sample(textureSampler, in.textureCoordinate);

    float4 out = float4(colorSample);
    // return the color of the texture
    return out;
}

上面 texture2d colorTexture [[ texture(0) ]] 这个属性限定符的作用，从我们在管道中配置的片段着色器数据的索引，例如：``` [renderEncoder setFragmentBuffer:_kernelSizeBuffer offset:0 atIndex:0]; ``` 这个就是获取我们这个配置的纹理数据。然后读取纹理中颜色值渲染到相应的顶点坐标位置。至此整个渲染过程就完成了。

4. 视频流混合的过程

渲染的基本流程梳理完成后，那我们看下鼠标纹理数据如何渲染到视频上的，其中核心就是设置鼠标的顶点坐标和纹理数据，以及在 Shader 中融合鼠标纹理和视图画面的纹理。我们先看下 WebRTC 视频帧顶点向量的定义。

float values[16] = {
-coordX, -coordY, cropLeft, cropBottom,
coordX, -coordY, cropRight, cropBottom,
-coordX,  coordY, cropLeft, cropTop,
coordX,  coordY, cropRight, cropTop};

其中 coordX ，coordY 这些都是裁剪过的视频帧的坐标位置，在 WebRTC 中这些顶点向量都是用相对位置来表示。所以我们在设置鼠标的顶点位置时，也用相对位置如下：

- (BOOL)setupTexturesForMouseFrame:(nonnull RTCMouseCursorFrame *)incomingFrame {
  MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor new];
    textureDescriptor.textureType = MTLTextureType2D;
    textureDescriptor.width = width;
    textureDescriptor.height = height;
    textureDescriptor.pixelFormat = MTLPixelFormatBGRA8Unorm;
    textureDescriptor.usage = MTLTextureUsageShaderRead;
    _cursorTexture = [_device newTextureWithDescriptor:texDescriptor];
    if (!_cursorTexture) {
      RTCLogError(@"[RND]MTLRender:%p Failed to create cursor texture", self);
      return NO;
    }
    // 纹理数据
    [_cursorTexture replaceRegion:MTLRegionMake2D(0, 0, width, height)
                      mipmapLevel:0
                        withBytes:[mouseFrameBuffer.rgbaData bytes]
                      bytesPerRow:mouseFrameBuffer.stride];
    
    // 顶点数据的相对位置
    float blendRect[4] = {left,top,right,bottom};
    memcpy((float *)_cursorBlendRectBuffer.contents, blendRect, sizeof(blendRect));
    return YES;
}

然后下面就是融合的片段着色器。

fragment half4 fragmentColorBlend(
        Varyings in[[stage_in]], texture2d<float, access::sample> textureY[[texture(0)]],
        texture2d<float, access::sample> textureCbCr[[texture(1)]],
        texture2d<float, access::sample> textureBlend[[texture(3)]],
        constant float4 &blendRect [[buffer(2)]], // (left, top, right, bottom)
        constant int &enableBlend [[buffer(3)]]) {
      constexpr sampler s(address::clamp_to_edge, filter::linear);
      constexpr sampler blendSampler(address::clamp_to_edge,
                                             filter::linear);
      float y;
      float2 uv;
      y = textureY.sample(s, in.texcoord).r;
      uv = textureCbCr.sample(s, in.texcoord).rg - float2(0.5, 0.5);

      float4 video = float4(y + 1.403 * uv.y, y - 0.344 * uv.x - 0.714 * uv.y, y + 1.770 * uv.x, 1.0);
      float4 out;
      if (enableBlend == 1 && in.texcoord.x >= blendRect.x && in.texcoord.x <= blendRect.z && in.texcoord.y >= blendRect.y && in.texcoord.y <= blendRect.w) {
        float factorW = blendRect.z - blendRect.x;
        float factorH = blendRect.w - blendRect.y;
        float2 blendTextureCoordinate = float2((in.texcoord.x - blendRect.x)/factorW,(in.texcoord.y - blendRect.y)/factorH);
        float4 blendResult = textureBlend.sample(blendSampler, blendTextureCoordinate);
        out = float4(mix(video.rgb, blendResult.rgb, blendResult.a), 1.0);
      } else {
        out = video;
      }
      return half4(out);
  }

融合的方法很简单。这里我们用了个取巧的方式，直接把鼠标的顶点相对位置存储到了片段着色器的 buffer 中。然后在片段着色器执行时，从 buffer 读取鼠标上下左右的顶点 blendRect，只要判断鼠标的顶点落入到了 blendRect 区域中。就从鼠标纹理中 textureBlend 获取相应的颜色值，通过向量运算返回融合的纹理数据。这里需要注意是纹理采样的数据都是2维的向量。但是片段着色器输出的是4维的向量分别表示 (r,g,b,a)，这个过程中要注意向量的定义，避免赋值造成的错误。

5. 总结

通过上面的分享基本可以了解用 Metal 如何把画面渲染上屏了。可以看到正是 Metal 框架的封装，才让我们很容易的利用 GPU 的资源渲染画面。虽然视频处理 Metal 渲染往往是 2D，但是了解其工作原理，对于我们做 3D 方面的渲染还是很有借鉴意义。当然如果要做 3D 渲染工作，其实还有很多复杂的工作要做，例如配置顶点向量的法线，光照以及纹理材质的深度的融合等等。这个过程往往会交给 3D 引擎来做，建模师只需要把模型配置好，加载模型后引擎来解析这些数据配置给 GPU。

如何用 Metal 做视频帧合流渲染

背景