使用Google Vision API进行计算机视觉图像创意分析

介绍

计算机视觉可以用来从图像、视频和音频中提取有用的信息。它允许计算机看到并理解从视觉输入中可以收集到什么信息。在接收到视觉输入后，它可以在图像中收集有价值的信息，并确定必须采取的下一步。

Google Vision API是一种Google云服务，它允许使用计算机视觉从图像输入中提取有价值的信息。作为初学者，你可以使用此服务获得对图像的有意义的见解。下图显示了Google视觉API的工作原理。

上图显示了Google Vision API的功能。Google Vision API可以识别广告图像中的面部表情、文本和主要颜色。面部表情清楚地显示了一个人的喜悦表情，文字描述了“LEARN MORE”一词，主导色显示了图像中前10个主导色。

我们可以看到，通过利用谷歌视觉API功能，我们可以从图像中获得很多见解。例如，假设我们想知道广告图像中的哪些因素导致客户点击并查看我们的广告。这可以通过使用Google视觉API服务来发现。

本文将主要关注如何在图像中获得洞察力因素，以及我们可以从特定图像中获得什么洞察力。我们不会使用广告图片示例，因为由于公司保密，它无法发布。相反，我们将使用Kaggle数据集中可用于数据分析的产品图像。

数据集

该项目的数据集图像基于Kaggle的时尚产品图像数据集。因为数据集包含大量来自电子商务网站的产品图像，我们只会获取一小部分图像，这些图像可以用于我们的创意分析。此数据集许可证允许你复制、修改、分发和执行工作。

设置Google云视觉API

在开始之前，我们必须首先从GoogleCloud服务配置visionAPI服务。可在此处找到分步说明。但是，为了让事情变得更简单，我们将一步一步地向你展示如何从Google云服务设置API。

（注意：你必须从自己的Google Cloud帐户配置此API；我们不会在本教程中向你提供包含机密Google Cloud密钥的文件）。

步骤1：登录Google Cloud Project，然后从主页选择“转到API概述”。

步骤2：选择“启用APIS和服务”，然后搜索并启用Cloud Vision API。

步骤3:转到凭据，然后单击“创建Credentials”，然后单击服务帐户。

步骤4：输入你的服务帐户信息（你可以跳过可选部分），然后单击“完成”。

步骤5：导航到你创建的服务帐户。转到KEYS，然后“ADD KEY”和“Create new KEY”。

步骤6：创建JSON密钥类型，然后下载JSON文件并将其放置在Python脚本的工作目录中。

安装必要的库

在开始计算机视觉建模之前，我们必须首先安装所需的库。我们将安装的第一个库是google-cloud-vision，它用于计算机视觉模型检测。我们可以在访问Google Cloud Vision API后使用此库。

pip install google-cloud-vision

下一个库是webcolors，当我们需要将颜色检测中的十六进制颜色数转换为我们所知道的最接近的颜色名称时，它非常有用。

!pip install webcolors

导入必要的库

安装必要的库后，我们将它们导入到脚本中。我们将从谷歌云库中导入视觉，用于视觉建模检测。对于数据预处理，使用了Ipython、io和panda等其他库。

from IPython.display import Image
from google.cloud import vision_v1 as vision
import io
import pandas as pd
import os

Webcolors用于将十六进制颜色格式转换为我们熟悉的颜色名称。KDTree用于查找CSS3库中最接近的颜色匹配。KDTree提供了一组k维点的索引，可用于快速查找任何点的最近邻居。

from scipy.spatial import KDTree
from webcolors import hex_to_rgb
from webcolors import CSS3_HEX_TO_NAMES

在python脚本上激活Google视觉API

将JSON文件放入目录后，我们必须在Python脚本中激活GoogleCloudVisionAPI服务。

# Activate Google vision API using service account key
client = vision.ImageAnnotatorClient.from_service_account_json("vision-api.json")
image = vision.types.Image()

标签检测

可以使用标签检测来检测图像中的任何标签。LabelAnnotation可用于标识图像中的常规对象、位置、活动、产品和其他内容。

下面的代码描述了如何从时尚数据集的图像中提取标签信息。

# Import the picture
pics = ["pic/label.png"]

for pic in pics:
    print("=" * 79)
    print("File:", pic)
    display(Image(pic, width = 500))
    
    with io.open(pic, "rb") as image_file:
        parse = image_file.read()
    
    query = {"image": {"content": parse},
             "features": [{"type_": "LABEL_DETECTION"}]}
    
    response = client.annotate_image(query)

    # Label detection
    labels = response.label_annotations
    print("Labels:")
    if labels:
        for index, label in enumerate(labels):
            if index != len(labels) - 1:
                print(label.description, end = ", ")
            else:
                print(label.description, end = "\n\n")
    else:
        print("[None]", end = "\n\n")

===============================================================================
File: pic/label.png

从这张图片中，我们可以看到谷歌视觉API检测到几个通用标签，例如：

面部表情（微笑）
人体（面部、关节、皮肤、手臂、肩部、腿部、人体、袖子）
对象（鞋）

尽管视觉识别了许多标签，但一些一般物体被错误识别，没有被提及。视觉将凉鞋图像误认为鞋子。它也无法识别上图中的衣服、叶子植物、杯子和椅子。

物体检测

对象检测可用于检测图像中的任何对象。与标记不同，对象检测主要关注检测的置信水平。LocalizedObjectAnnotation扫描图像中的多个对象，并显示矩形边界内的对象位置。

# Import the picture
pics = ["pic/object.png"]

for pic in pics:
    print("=" * 79)
    print("File:", pic)
    display(Image(pic, width = 500))
    
    with io.open(pic, "rb") as image_file:
        parse = image_file.read()
    
    query = {"image": {"content": parse},
             "features": [{"type_": "OBJECT_LOCALIZATION"}]}
    
    response = client.annotate_image(query)

    # Object Localization detection
    objects = response.localized_object_annotations
    if objects:
        print("Number of objects found: {}".format(len(objects)))
        for obj in objects:
            print("{} (Confidence: {})".format(obj.name, obj.score))
            print("  Normalized bounding polygon vertices: ")
            for vertex in obj.bounding_poly.normalized_vertices:
                print("  • ({}, {})".format(vertex.x, vertex.y))
    else:
        print("[None]")

===============================================================================
File: pic/object.png

Number of objects found: 7
Sunglasses (Confidence: 0.9075868129730225)
  Normalized bounding polygon vertices: 
  • (0.7152940630912781, 0.03025873191654682)
  • (0.9912452101707458, 0.03025873191654682)
  • (0.9912452101707458, 0.18952979147434235)
  • (0.7152940630912781, 0.18952979147434235)
Necklace (Confidence: 0.827012300491333)
  Normalized bounding polygon vertices: 
  • (0.7888254523277283, 0.23976582288742065)
  • (0.8789485096931458, 0.23976582288742065)
  • (0.8789485096931458, 0.5786653757095337)
  • (0.7888254523277283, 0.5786653757095337)
Necklace (Confidence: 0.765210747718811)
  Normalized bounding polygon vertices: 
  • (0.6977918744087219, 0.6217525601387024)
  • (0.9729085564613342, 0.6217525601387024)
  • (0.9729085564613342, 0.8596880435943604)
  • (0.6977918744087219, 0.8596880435943604)
Miniskirt (Confidence: 0.7599698305130005)
  Normalized bounding polygon vertices: 
  • (0.08722742646932602, 0.5060120224952698)
  • (0.5405347347259521, 0.5060120224952698)
  • (0.5405347347259521, 0.9896653294563293)
  • (0.08722742646932602, 0.9896653294563293)
Shirt (Confidence: 0.7503857612609863)
  Normalized bounding polygon vertices: 
  • (0.026824740692973137, 0.01114667858928442)
  • (0.639977216720581, 0.01114667858928442)
  • (0.639977216720581, 0.5381563901901245)
  • (0.026824740692973137, 0.5381563901901245)
Clothing (Confidence: 0.70070880651474)
  Normalized bounding polygon vertices: 
  • (0.014639005064964294, 0.006572074722498655)
  • (0.6691749095916748, 0.006572074722498655)
  • (0.6691749095916748, 0.9973958134651184)
  • (0.014639005064964294, 0.9973958134651184)
Necklace (Confidence: 0.5183205604553223)
  Normalized bounding polygon vertices: 
  • (0.23560701310634613, 0.0359032079577446)
  • (0.42432689666748047, 0.0359032079577446)
  • (0.42432689666748047, 0.25690552592277527)
  • (0.23560701310634613, 0.25690552592277527)

从这张图片中，我们可以看到谷歌视觉API检测到以下几个对象：

Sunglasses (Confidence: 90%)

Necklace 1 (Confidence: 83%)

Necklace 2 (Confidence: 77%)

Miniskirt (Confidence: 76%)

Shirt (Confidence: 75%)

Clothing (Confidence: 70%)

Necklace 3 (Confidence: 51%)

在上面的图像中，我们可以看到大部分的视觉已经识别出了一件衣服。还有一些其他物品，如太阳镜、项链1、项链2、迷你裙、衬衫和衣服。项链3的可信度最低，因为视觉认为右下角的图像也是项链。因为项链3的对象更像是一个手镯而不是项链，所以它的可信度低于其他对象。

面部表情检测

人脸检测可以检测图像中的任何人脸和情绪。FaceAnnotation是一种用于扫描人脸在图像中的位置的技术。在扫描人脸的同时，它还可以扫描各种面部表情。

# Import the picture
pics = ["pic/face_expression.png"]

for pic in pics:
    print("=" * 79)
    print("File:", pic)
    display(Image(pic, width = 500))
    
    with io.open(pic, "rb") as image_file:
        parse = image_file.read()
    
    query = {"image": {"content": parse},
             "features": [{"type_": "FACE_DETECTION"}]}
    
    response = client.annotate_image(query)

    # Face expression detection
    faces = response.face_annotations
    likelihood_name = ("UNKNOWN", "VERY_UNLIKELY", "UNLIKELY", "POSSIBLE", "LIKELY", "VERY_LIKELY")
    if faces:
        for index, face in enumerate(faces):
            print("Face", index + 1)
            print("Joy: {}".format(likelihood_name[face.joy_likelihood]))
            print("Sorrow: {}".format(likelihood_name[face.sorrow_likelihood]))
            print("Anger: {}".format(likelihood_name[face.anger_likelihood]))
            print("Surprise: {}".format(likelihood_name[face.surprise_likelihood]))
            vertices = (["({},{})".format(vertex.x, vertex.y)
                        for vertex in face.bounding_poly.vertices])
            print("Face Bounds: {}".format(", ".join(vertices)), end = "\n\n")

===============================================================================
File: pic/face_expression.png

Face 1
Joy: VERY_LIKELY
Sorrow: VERY_UNLIKELY
Anger: VERY_UNLIKELY
Surprise: VERY_UNLIKELY
Face Bounds: (115,0), (240,0), (240,141), (115,141)

从上图中，我们可以看到谷歌视觉API检测到人脸上的各种表情，例如：

Joy: VERY_LIKELY

Sorrow: VERY_UNLIKELY

Anger: VERY_UNLIKELY

Surprise: VERY_UNLIKELY

我们可以从上图中看到，表情是一个微笑，视觉API将其识别为一个快乐的表情。悲伤、愤怒和惊讶的其他表情似乎与上图不符，因为此人没有表达这些情绪。

文本检测

TextAnnotation可用于检测和提取图像中的文本。在其矩形边界内的单个单词和句子包含在提取的文本中。

# Import the picture
pics = ["pic/text.png"]

for pic in pics:
    print("=" * 79)
    print("File:", pic)
    display(Image(pic, width = 500))
    
    with io.open(pic, "rb") as image_file:
        parse = image_file.read()
    
    query = {"image": {"content": parse},
             "features": [{"type_": "TEXT_DETECTION"}]}
    
    response = client.annotate_image(query)

    # Text detection
    texts = response.text_annotations
    print("Texts:")
    if texts:
        for text in texts:
            print("\"{}\"".format(text.description))
            vertices = (["({},{})".format(vertex.x, vertex.y)
                        for vertex in text.bounding_poly.vertices])
            print("  • Text Bounds: {}".format(", ".join(vertices)))
        print("\n")
    else:
        print("[None]", end = "\n\n")

===============================================================================
File: pic/text.png

Texts:
"文化
THISIS
WHAT
awesome
LOOKS LIKE"
  • Text Bounds: (52,101), (249,101), (249,423), (52,423)
"文化"
  • Text Bounds: (67,101), (73,127), (59,130), (53,104)
"THISIS"
  • Text Bounds: (240,356), (248,399), (231,402), (223,359)
"WHAT"
  • Text Bounds: (223,360), (229,397), (214,400), (207,363)
"awesome"
  • Text Bounds: (204,354), (214,410), (205,412), (195,356)
"LOOKS"
  • Text Bounds: (189,342), (199,392), (182,395), (172,345)
"LIKE"
  • Text Bounds: (199,392), (204,419), (188,423), (182,395)

从这张图片中，我们可以看到谷歌视觉API检测到了各种文本，例如：

文化

THISIS

WHAT

awesome

LOOKS

出于某种原因，该愿景确定了一个中文单词文化。

图像中的文本显示它检测大写和非大写单词。它还在上面的文本中发现了单词“THISIS”，应该是“THIS IS”。因此，vision API的限制是它将检测到单词“THIS IS”到“THSIS”，因为该单词太窄。

主色检测

主色检测是图像财产标注的特征之一。它可以使用主色检测来检测图像中的前十个主特征颜色及其颜色分数。

# Import the picture
pics = ["pic/color.png"]

# Convert the float number to hex color number
def float2hex(val):
    hexval = hex(int(val))[2:].upper()
    hexval = "0" + hexval if len(hexval) < 2 else hexval
    return hexval

for pic in pics:
    print("=" * 79)
    print("File:", pic)
    display(Image(pic, width = 500))
    
    with io.open(pic, "rb") as image_file:
        parse = image_file.read()
    
    query = {"image": {"content": parse},
             "features": [{"type_": "IMAGE_PROPERTIES"}]}
    
    response = client.annotate_image(query)

    # Color detection        
    props = response.image_properties_annotation
    if props.dominant_colors.colors:
        for index, color in enumerate(props.dominant_colors.colors):
            print("Color", index, "Fraction: {}".format(color.pixel_fraction))
            print("Hex Color: ",float2hex(color.color.red) + float2hex(color.color.green)+float2hex(color.color.blue))
        print("")
    else:
        print("[None]", end = "\n\n")

===============================================================================
File: pic/color.png

Color 0 Fraction: 0.16305483877658844
Hex Color:  141313
Color 1 Fraction: 0.010103825479745865
Hex Color:  A41B24
Color 2 Fraction: 0.010382551699876785
Hex Color:  68191E
Color 3 Fraction: 0.0016026757657527924
Hex Color:  CD2533
Color 4 Fraction: 0.25266531109809875
Hex Color:  EDE9E8
Color 5 Fraction: 0.004111211746931076
Hex Color:  B79082
Color 6 Fraction: 0.009755417704582214
Hex Color:  9D2735
Color 7 Fraction: 0.00689847394824028
Hex Color:  883339
Color 8 Fraction: 0.0002090446650981903
Hex Color:  153019
Color 9 Fraction: 0.0018814019858837128
Hex Color:  BE182A

Google Vision API检测到十六进制格式的前十种不同颜色，如上图所示。要获得真实的颜色名称，我们必须使用CSS3库将十六进制颜色格式转换为颜色名称格式。然后，我们利用KDTree获取CSS3库中熟悉的最接近的颜色。

def convert_rgb_to_names(rgb_tuple):
    # a dictionary of all the hex and their respective names in css3
    css3_db = CSS3_HEX_TO_NAMES
    names = []
    rgb_values = []
    for color_hex, color_name in css3_db.items():
        names.append(color_name)
        rgb_values.append(hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return f'{names[index]}'

# Detect the hex color to the real color name
h = 'A41B24'
print(h)
print('RGB =', tuple(int(h[i:i+2], 16) for i in (0, 2, 4)))
j = tuple(int(h[i:i+2], 16) for i in (0, 2, 4))
print(convert_rgb_to_names(j))

A41B24
RGB = (164, 27, 36)
firebrick

我们使用十六进制颜色A41B24作为第二主色的示例。使用上述函数，我们发现CSS3库中最接近的颜色是耐火砖。上图中运动鞋的红色表明了这一点。

结论

我们已经使用标签、对象、面部表情、文本和上述创造性分析中的主要颜色检测进行了计算机视觉建模。在我们运行视觉API之后，每个检测注释仍有许多限制。

标签检测：它可以检测图片上的许多一般物体和面部表情，但其中一些物体会被错误识别（例如，在我们的分析中，它将凉鞋错误识别为鞋子）。
物体检测：这些因素中也存在被错误识别的情况，但我们可以通过查看物体检测置信水平来预测（例如，在我们的分析中，它将手镯错误识别为另一条项链）。
面部表情检测：图像清晰地显示了男孩快乐的表情。然而，如果你尝试使用视觉建模中排除了表情的图像，则会将所有表情检测归类为非常不可能，因为视觉建模无法确定它是哪个表情。
文本检测：可以使用文本注释检测文本，但有一个不需要的文本包含在我们的视觉建模中（例如，在我们的分析中，它检测到中文单词“文化” 即使图片上没有中文单词）。
主色检测：它可以检测上面图像中的多种主色，但目前只能转换为RGB或十六进制颜色格式。要转换为我们熟悉的颜色，必须添加一个将十六进制颜色转换为颜色名称的函数。

如果你想了解更多关于这种建模方法中使用的代码的详细信息，可以查看Github存储库。

https://github.com/nugrahazikry/Simple-Computer-Vision-Image-Creative-Analysis-using-Google-Vision-API

（注意：Google云密钥在存储库中不可用，你必须使用上面的步骤自行创建）。