Import HTML annotations

How to import annotations on HTML data and sample import formats.

You can use the Python SDK to import annotations on HTML data.

This page shows how to declare different annotation types (as Python dictionaries and NDJSON objects) and demonstrates the import process.

A Python notebook demonstrates these steps and can be run directly with Google CoLab.

Supported annotations

To import annotations in Labelbox, you need to create the annotations payload. This section shows how to declare annotations for each supported annotation type.

You can declare annotations as Python SDK annotation types (preferred) or as NDJSON objects.

Classification: Radio (single-choice)

text_annotation = lb_types.ClassificationAnnotation(
    name="text_html",
    value=lb_types.Text(answer="sample text")
)
text_annotation_ndjson = {
    'name': 'text_html',
    'answer': 'sample text',
}

Classification: Checklist (multi-choice)

checklist_annotation= lb_types.ClassificationAnnotation(
  name="checklist_html", # must match your ontology feature's name
  value=lb_types.Checklist(
      answer = [
        lb_types.ClassificationAnswer(
            name = "first_checklist_answer"
        ), 
        lb_types.ClassificationAnswer(
            name = "second_checklist_answer"
        )
      ]
    )
 )
checklist_annotation_ndjson = {
    'name': 'checklist_html',
    'answers': [
        {'name': 'first_checklist_answer'},
        {'name': 'second_checklist_answer'}
    ]
}

Classification: Free-form text

text_annotation = lb_types.ClassificationAnnotation(
    name="text_html",
    value=lb_types.Text(answer="sample text")
)
text_annotation_ndjson = {
    'name': 'text_html',
    'answer': 'sample text',
}

Example: Import prelabels or ground truth

The steps to import annotations as prelabels (machine assisted learning) are very similar to the steps to import annotations as ground truth labels. They vary in Steps 5 and 6, which detail the differences for each scenario.

Before you start

You will need to import these libraries to use the code examples in this section.

import labelbox as lb
import uuid
import labelbox.types as lb_types

Replace API key

API_KEY = ""
client = lb.Client(API_KEY)

Step 1: Import data rows

The data row must be uploaded to Catalog before attaching annotations.

This example shows how to create an HTML data row.

global_key = "sample_html_1.html"

asset = {
    "row_data": "https://storage.googleapis.com/labelbox-datasets/html_sample_data/sample_html_1.html",
    "global_key": global_key
}

dataset = client.create_dataset(
    name="html_annotation_import_demo_dataset", 
    iam_integration=None # Removing this argument will default to the organziation's default iam integration
) 
task = dataset.create_data_rows([asset])
task.wait_till_done()
print("Errors:", task.errors)
print("Failed data rows: ", task.failed_data_rows)

Step 2: Create an ontology

Your project ontology should include all tools and classifications required by your annotations. To ensure schema feature matches, the tool names and classification names should match the name fields in your annotation.

To illustrate, suppose you set name to text_html when you created the text annotation. When creating the ontology, the same value is used in the name field of the text classification. The same process must be followed for each tool and classification created in the ontology.

ontology_builder = lb.OntologyBuilder(
  classifications=[ 
    lb.Classification( 
      class_type=lb.Classification.Type.TEXT,
      name="text_html"), 
    lb.Classification( 
      class_type=lb.Classification.Type.CHECKLIST,                   
      name="checklist_html", 
      options=[
        lb.Option(value="first_checklist_answer"),
        lb.Option(value="second_checklist_answer")            
      ]
    ), 
    lb.Classification( 
      class_type=lb.Classification.Type.RADIO, 
      name="radio_html", 
      options=[
        lb.Option(value="first_radio_answer"),
        lb.Option(value="second_radio_answer")
      ]
    )
  ]
)

ontology = client.create_ontology("Ontology HTML Annotations", ontology_builder.asdict(), media_type=lb.MediaType.Html)

Step 3: Create labeling project

Connect the ontology to the labeling project.

project = client.create_project(name="html_project", 
                                    media_type=lb.MediaType.Html)

# Setup your ontology 
project.setup_editor(ontology)

Step 4: Send data rows to project

batch = project.create_batch(
  "first-batch-html-demo", # Each batch in a project must have a unique name
  global_keys=[global_key], # Paginated collection of data row objects, list of data row ids or global keys
  priority=5 # priority between 1(highest) - 5(lowest)
)

print("Batch: ", batch)

Step 5: Create annotation payloads

Use the earlier examples for help creating annotation payloads.

These examples show each supported annotation format and describe how to compose annotations into labels attached to the data rows.

These examples show how to create each supported annotation type.

label = []
label.append(
  lb_types.Label(
    data=lb_types.HTMLData(
      global_key=global_key
    ),
    annotations=[
      text_annotation,
      checklist_annotation,
      radio_annotation
    ]
  )
)
label_ndjson = []
for annotations in [text_annotation_ndjson,
                    checklist_annotation_ndjson,
                    radio_annotation_ndjson]:
  annotations.update({
      'dataRow': {
          'globalKey': global_key
      }
  })
  label_ndjson.append(annotations)

Step 6: Import annotation payload

Whether you're uploading annotations as prelabels (model assisted labeling) or as ground truth labels, pass your annotation payloads the the value of the predictions or labels parameters.

Option A: Upload as prelabels (model-assisted labeling)

upload_job = lb.MALPredictionImport.create_from_objects(
    client = client, 
    project_id = project.uid, 
    name=f"mal_job-{str(uuid.uuid4())}", 
    predictions=label)

upload_job.wait_until_done();
print("Errors:", upload_job.errors)
print("Status of uploads: ", upload_job.statuses)

Option B: Upload as ground truth

upload_job = lb.LabelImport.create_from_objects(
    client = client, 
    project_id = project.uid, 
    name="label_import_job"+str(uuid.uuid4()),  
    labels=label_ndjson)

print("Errors:", upload_job.errors)