{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Inizio radunando tutti gli **IMPORT** necessari per girare il notebook, per chiarezza." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# IMPORT ESSENZIALI\n", "\n", "# Per il parsing dell'XML -- questo pacchetto è incluso anche nel più generale lxml\n", "import xml.etree.ElementTree as ET\n", "# Utilities per leggere/scrivere files csv\n", "import csv\n", "# Utilities per gestire i character encodings\n", "import unicodedata\n", "# Dizionari ordinati\n", "from collections import OrderedDict\n", "\n", "\n", "# IMPORT OPZIONALI\n", "\n", "# Per fare un stima della velocità delle varie istruzioni\n", "from datetime import datetime\n", "# Generatore di numeri casuali -- può sempre servire in fase di testing\n", "from random import *\n", "# Può servire per alcuni test\n", "import sys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# FUNZIONI\n", "\n", "**ElementTree** ha una funzione built-in, **iter**, che scorre (molto velocemente) su tutti i 'nodi' dell'albero di dati che rappresenta l'XML. La funzione *iter* purtroppo però non traccia i nodi 'parents'.\n", "\n", "Ho esteso quindi la libreria scrivendo una mia versione di *iter*, **'traceElems'**, che dovrebbe riuscire a fornirci tutto quello di cui abbiamo bisogno.\n", "\n", "*traceElems* traccia tutti i nodi nell'albero tenendo conto dei 'parents', e restituisce tutti quelli per cui la funzione-argomento 'condition' ritorna True. **NON** indaga i nodi **figli** di quelli che sono restituiti." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# La funzione BASE: traceElems\n", "def traceElems(node: ET.Element, condition, parents: list = [], coords: list = []):\n", " res = []\n", " jj = 0\n", " for child in node:\n", " if condition(child):\n", " res.append({'a_par': parents+[node],\n", " 'coords': coords+[jj], 'child': child})\n", " else:\n", " res = res + traceElems(child, condition, parents+[node], coords+[jj])\n", " jj = jj+1 \n", " return res\n", "\n", "# Funzione-base per stoppare traceElems\n", "def isLeafOrC(aa: ET.Element):\n", " if(aa.tag=='c' or len(aa)==0):\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Funzioni-utilità che servono solo a visualizzare meglio i dati sul notebook.\n", "def shownode(node: ET.Element):\n", " return (node.tag, node.attrib, node.text.replace('\\t','').replace('n','').strip() \\\n", " if type(node.text) is str else '')\n", "\n", "def shownodelist(el: ET.Element):\n", " return list(map(shownode, el))\n", "\n", "\n", "# Utility copiata da INTERNEZZ -- versione 'multipla' del metodo str.index:\n", "def indices(lst, element):\n", " result = []\n", " offset = -1\n", " while True:\n", " try:\n", " offset = lst.index(element, offset+1)\n", " except ValueError:\n", " return result\n", " result.append(offset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# AL LAVORO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**DA CAMBIARE A SECONDA DEL COMPUTER**: directory di input e output" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import_dir = '/Users/federicaspinelli/TEAMOVI/Parser/DATA/ASPO/XML/'\n", "export_dir = '/Users/federicaspinelli/TEAMOVI/Parser/DATA/ASPO/DATE/DATINI/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importo il file XML del Datini, tracciando il tempo necessario" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "34.2486310005188\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "treeDatini = ET.parse(import_dir + 'export_aspoSt001--datini.xml')\n", "rootDatini = treeDatini.getroot()\n", "\n", "ts2 = datetime.timestamp(datetime.now())\n", "print(ts2 - ts1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uso *iter* per trovare tutti i nodi con label **'c'** nel file Datini, e mi faccio restituire il\n", "valore dell'attributo **'level'**; salvo tutti i *levels* nella variabile **cLevs**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'otherlevel', 'series', 'subfonds', 'file', 'item', 'collection', 'fonds', 'subseries'}\n" ] } ], "source": [ "cLevs = set(map(lambda a : a.attrib['level'], rootDatini.iter('c')))\n", "print(cLevs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A questo punto metto al lavoro la funzione **traceElems**: registro TUTTI i nodi **'c'** dividendoli in base all'attributo **'level'**; mi faccio stampare il numero di elementi per ogni livello ed il tempo trascorso.\n", "\n", "**OCCHIO:** per come è costruita, questa routine non va ad investigare dentro i livelli restituiti -- quindi si perde eventuali sotto-livelli con la stessa label di quelli che trova durante il primo scan. La presenza di sotto-livelli di questo tipo va controllata separatamente." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# di tag \"c\", livello otherlevel, primo passaggio: 10\n", "# di tag \"c\", livello series, primo passaggio: 35\n", "# di tag \"c\", livello subfonds, primo passaggio: 14\n", "# di tag \"c\", livello file, primo passaggio: 15449\n", "# di tag \"c\", livello item, primo passaggio: 149085\n", "# di tag \"c\", livello collection, primo passaggio: 1\n", "# di tag \"c\", livello fonds, primo passaggio: 1\n", "# di tag \"c\", livello subseries, primo passaggio: 1365\n", "\n", "Tempo trascorso: 12.053628206253052\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "allCs = {}\n", "\n", "for label in cLevs:\n", " def tempFilt(aa: ET.Element):\n", " if(aa.tag=='c' and aa.attrib['level']==label):\n", " return True\n", " else:\n", " return False\n", " \n", " allCs[label] = traceElems(rootDatini, tempFilt);\n", " print('# di tag \"c\", livello ' + label + ', primo passaggio:', len(allCs[label]))\n", "\n", "print()\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notare che l'elaborazione è piuttosto veloce (sul mio laptop) malgrado la dimensione del file.\n", "\n", "Rimane il problema dei livelli dentro a livelli omonimi. Vediamo di affrontarlo." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# di tag \"c\", livello otherlevel, primo passaggio: 10\n", "# di tag \"c\", livello otherlevel, totali: 10\n", "# di tag \"c\", livello series, primo passaggio: 35\n", "# di tag \"c\", livello series, totali: 61\n", "# di tag \"c\", livello subfonds, primo passaggio: 14\n", "# di tag \"c\", livello subfonds, totali: 14\n", "# di tag \"c\", livello file, primo passaggio: 15449\n", "# di tag \"c\", livello file, totali: 15449\n", "# di tag \"c\", livello item, primo passaggio: 149085\n", "# di tag \"c\", livello item, totali: 149085\n", "# di tag \"c\", livello collection, primo passaggio: 1\n", "# di tag \"c\", livello collection, totali: 4\n", "# di tag \"c\", livello fonds, primo passaggio: 1\n", "# di tag \"c\", livello fonds, totali: 1\n", "# di tag \"c\", livello subseries, primo passaggio: 1365\n", "# di tag \"c\", livello subseries, totali: 1477\n", "\n", "Tempo trascorso: 27.76724910736084\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "allCs2 = {}\n", "\n", "for label in cLevs:\n", " partial = allCs[label]\n", " print('# di tag \"c\", livello ' + label + ', primo passaggio:', len(partial))\n", " allCs2[label] = partial\n", " partialUpdate = []\n", " while True:\n", " def tempFilt(aa: ET.Element):\n", " if(aa.tag=='c' and aa.attrib['level']==label):\n", " return True\n", " else:\n", " return False\n", " for node in partial:\n", " partialUpdate = partialUpdate + traceElems(node['child'], tempFilt)\n", " #print(len(partialUpdate))\n", " partial = partialUpdate\n", " if(len(partialUpdate)==0):\n", " break\n", " allCs2[label] = allCs2[label] + partial\n", " partialUpdate = []\n", "\n", " print('# di tag \"c\", livello ' + label + ', totali:', len(allCs2[label]))\n", "\n", "print()\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A questo punto diventa facile visualizzare tutti i dettagli dei vari elementi **'c'**, di qualunque livello; un esempio è fornito nella prossima cella. Si può cambiare l'elemento da visualizzare cambiando il valore delle variabili *ii* e *level*" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Level: otherlevel\n", "#: 1\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('did', {}, '')]\n", "[0, 0]\n", "('unittitle', {'encodinganalog': 'ISAD 1 - 2 title'}, 'Busta 1167')\n", "# of children: 0\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('processinfo', {}, ''), ('list', {}, ''), ('item', {}, '')]\n", "[1, 0, 0, 0]\n", "('date', {}, '19/03/2013')\n", "# of children: 0\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('processinfo', {}, ''), ('list', {}, ''), ('item', {}, '')]\n", "[1, 0, 0, 1]\n", "('persname', {}, 'Admi xDams - ope source')\n", "# of children: 0\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 0]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164307', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 1]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164308', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 2]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164309', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 3]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164310', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 4]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164311', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 5]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164312', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 6]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164313', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 7]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164314', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 8]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164315', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 9]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164316', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 10]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164317', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 11]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164318', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 12]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164319', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 13]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164320', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 14]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164321', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 15]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164322', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 16]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164323', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 17]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164324', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 18]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164325', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 19]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164326', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 20]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164327', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 21]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164328', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 22]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164329', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 23]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164330', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 24]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164331', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 25]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164332', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 26]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164333', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 27]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164334', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 28]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164335', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 29]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164336', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 30]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164337', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 31]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164338', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 32]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164339', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 33]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164340', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 34]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164341', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 35]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164342', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 36]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164343', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 37]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164344', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 38]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164345', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 39]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164346', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 40]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164347', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 41]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164348', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 42]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164349', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 43]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164350', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 44]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164351', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 45]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164352', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 46]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164353', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 47]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164354', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 48]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164355', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 49]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164356', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 50]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164357', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 51]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164358', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 52]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164359', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 53]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164360', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 54]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164361', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 55]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164362', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 56]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164363', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 57]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164364', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 58]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164365', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 59]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164366', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 60]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164367', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 61]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164368', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 62]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164369', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 63]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164370', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 64]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164371', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 65]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164372', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 66]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164373', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 67]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164374', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 68]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164375', 'level': 'item'}, '')\n", "# of children: 3\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 69]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164376', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 70]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164377', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 71]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164378', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 72]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164379', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 73]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164380', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 74]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164381', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 75]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164382', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 76]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164383', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 77]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164384', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 78]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164385', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 79]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164386', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 80]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164387', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 81]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164388', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 82]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164389', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 83]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164390', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 84]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164391', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 85]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164392', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 86]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164393', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 87]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164394', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 88]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164395', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 89]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164396', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 90]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164397', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 91]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164398', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 92]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164399', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 93]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164400', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 94]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164401', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 95]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164402', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 96]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164403', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 97]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164404', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 98]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164405', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 99]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164406', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 100]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164407', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 101]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164408', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 102]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164409', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 103]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164410', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 104]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164411', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 105]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164412', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 106]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164413', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 107]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164414', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 108]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164415', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 109]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164416', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 110]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164417', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 111]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164418', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 112]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164419', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 113]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164420', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 114]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164421', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 115]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164422', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 116]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164423', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 117]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164424', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 118]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164425', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 119]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164426', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 120]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164427', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 121]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164428', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 122]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164429', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 123]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164430', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 124]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164431', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 125]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164432', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 126]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164433', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 127]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164434', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 128]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164435', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 129]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164436', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 130]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164437', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 131]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164438', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 132]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164439', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 133]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164440', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 134]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164441', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 135]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164442', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 136]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164443', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 137]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164444', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 138]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164445', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 139]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164446', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 140]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164447', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 141]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164448', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 142]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164449', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 143]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164450', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 144]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164451', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 145]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164452', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 146]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164453', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 147]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164454', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 148]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164455', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 149]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164456', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 150]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164457', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 151]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164458', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 152]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164459', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 153]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164460', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 154]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164461', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 155]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164462', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 156]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164463', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 157]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164464', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 158]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164465', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 159]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164466', 'level': 'item'}, '')\n", "# of children: 5\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 160]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164467', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 161]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164468', 'level': 'item'}, '')\n", "# of children: 4\n", "\n", "[('c', {'audience': 'internal', 'id': 'ASPO00164306', 'level': 'otherlevel', 'otherlevel': 'container'}, ''), ('dsc', {}, '')]\n", "[2, 162]\n", "('c', {'audience': 'internal', 'id': 'ASPO00164469', 'level': 'item'}, '')\n", "# of children: 5\n", "\n" ] } ], "source": [ "ii = 1\n", "level = 'otherlevel'\n", "test = allCs2[level][ii]\n", "toProc = traceElems(test['child'], isLeafOrC)\n", "\n", "# Commentare/scommentare per stampare qui / su file\n", "# (vedi anche in fondo alla cella)\n", "#provaFileName = 'out.txt'\n", "#orig_stdout = sys.stdout\n", "#fp = open(export_dir + provaFileName, 'w')\n", "#sys.stdout = fp\n", "# fino qui + in fondo\n", "\n", "print()\n", "print()\n", "print('Level:', level)\n", "print('#:', ii)\n", "print()\n", "for node in toProc:\n", " print(shownodelist(node['a_par']))\n", " print(node['coords'])\n", " print(shownode(node['child']))\n", " print('# of children:', len(node['child']))\n", " print()\n", "\n", "\n", "# Commentare/scommentare per stampare qui / su file\n", "#sys.stdout = orig_stdout\n", "#fp.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(*NOTA X ME:* **'did' = 'Descriptive IDentification'**)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A questo punto, quello che devo fare è scrivere un **traduttore** -- una funzione che scorra l'output degli elementi esaminati e trasformi le info in modo da poterle esportare in formato csv (o in qualunque altro formato vogliamo).\n", "\n", "La mia attuale versione di **traduttore per gli item** è data nella prossima cella; accetta come argomento un nodo (che è supposto essere di tipo item) e restituisce un dict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def traduttoreItem(elem):\n", " # Variabile che contiene l'output della traduzione:\n", " csvProt = {}\n", "\n", " # Processo i nodi-parent di 'elem'\n", " par_tags = list(map(lambda a: a.tag, elem['a_par']))\n", " par_attributes = list(map(lambda a: a.attrib, elem['a_par']))\n", " \n", " # e0: Le varie id dei nodi parent\n", " for ii in indices(par_tags, 'c'):\n", " key = 'id_' + par_attributes[ii]['level']\n", " csvProt[key] = par_attributes[ii]['id']\n", "\n", " # Processo i nodi-child di 'elem'\n", " toProc = traceElems(elem['child'], isLeafOrC)\n", " first = True\n", " for node in toProc:\n", " tags = list(map(lambda a: a.tag, node['a_par'])) + [node['child'].tag]\n", " attributes = list(map(lambda a: a.attrib, node['a_par'])) + [node['child'].attrib]\n", " content = node['child'].text\n", "\n", " # Da controllare solo per il primo nodo\n", " # (informazioni a livello del nodo, uguali per tutti i figli)\n", " if(first):\n", " # e1 ID della item\n", " csvProt['id'] = attributes[tags.index('c')]['id']\n", " # e2 Audience: external o internal\n", " try:\n", " csvProt['audience'] = attributes[tags.index('c')]['audience']\n", " except:\n", " pass\n", " # e3 Otherlevel\n", " try:\n", " csvProt['altro_livello'] = attributes[tags.index('c')]['otherlevel']\n", " except:\n", " pass\n", " first = False\n", "\n", " # La 'ciccia': si processa il contenuto vero e proprio\n", " # e4 Repository (qui dovrebbe essere sempre l'Archivio di Prato)\n", " if('repository' in tags):\n", " csvProt['repository'] = content \n", "\n", " \n", " # e8 Tipologia\n", " try:\n", " ii = tags.index('materialspec')\n", " if(attributes[ii]['label']=='tipologia'):\n", " csvProt['tipologia'] = content\n", " except:\n", " pass\n", " \n", " # e9 Segnature buste e registri Datini\n", " try:\n", " ii = tags.index('container')\n", " type1 = attributes[ii]['type']\n", " if(type1.find('numero un')>=0):\n", " csvProt['segnatura_registri_1'] = content\n", " elif(type1.find('numero sott')>=0):\n", " csvProt['segnatura_registri_2'] = content\n", " elif(type1=='busta'):\n", " csvProt['segnatura_busta'] = content\n", " elif(type1=='inserto'):\n", " csvProt['segnatura_inserto'] = content\n", " except:\n", " pass\n", " # e9 Segnatura codice\n", " try:\n", " ii = tags.index('num')\n", " type1 = attributes[ii]['type']\n", " if(type1=='chiave'):\n", " csvProt['segnatura_codice'] = content\n", " except:\n", " pass\n", "\n", " # e9 Segnature subseries\n", " if (attributes[tags.index('c')]['level']=='subseries' or attributes[tags.index('c')]['level']=='file'):\n", " try:\n", " ii = tags.index('unitid')\n", " if(attributes[ii]['type']=='segnatura'):\n", " csvProt['segnatura_parent'] = content\n", " except:\n", " pass\n", " \n", " # e10 Titolo\n", " if('title' in tags):\n", " csvProt['titolo_originale'] = content \n", "\n", " # e12 Scope-content head & body\n", " if('scopecontent' in tags):\n", " if('list' not in tags and 'head' in tags):\n", " csvProt['scope-content_head'] = content\n", " else:\n", " if('p' in tags and 'num' not in tags):\n", " csvProt['scope-content_body'] = content\n", " if('num' in tags):\n", " try:\n", " ii = tags.index('num')\n", " if(attributes[ii]['type']):\n", " key = 'num_'+ attributes[ii]['type']\n", " csvProt[key] = content\n", " except:\n", " pass \n", " # e12 lista merci\n", " if('scopecontent' in tags):\n", " if('list' in tags):\n", " try:\n", " ii = tags.index('list')\n", " try:\n", " csvProt['lista'] = csvProt['lista'] + ' | ' + content\n", " except:\n", " csvProt['lista'] = content\n", " except:\n", " pass\n", " \n", " # e13 Origine\n", " try:\n", " ii = tags.index('origination')\n", " csvProt['origine'] = attributes[ii]['label'] + ': ' + content\n", " except:\n", " pass \n", " # e14 Nome della compagnia\n", " if ('unittitle' in tags):\n", " if ('corpname' in tags):\n", " try:\n", " ii = tags.index('corpname') \n", " if (attributes[ii]['authfilenumber'] or attributes[ii]['role']=='compagnia'):\n", " try:\n", " authId = attributes[ii]['authfilenumber']\n", " csvProt['compagnia'] = '{nome: ' + content + ', authID: ' + authId + '}'\n", " except:\n", " csvProt['compagnia'] = '{nome: ' + content + '}'\n", " except:\n", " pass \n", " # e15 Soggetto \n", " try:\n", " aa = csvProt['soggetto']\n", " except:\n", " try:\n", " ii = tags.index('subject')\n", " try:\n", " csvProt['soggetto'] = str(node['a_par'][ii].text).replace('\\t','').replace('\\n','').strip()\n", " except:\n", " csvProt['soggetto'] = str(content).replace('\\t','').replace('\\n','').strip()\n", " except:\n", " pass\n", " \n", " \n", " # e17 Date varie: tutte quelle con 'type' definito + note + normalizzazione\n", " if ('unittitle' in tags):\n", " try:\n", " ii = tags.index('date')\n", " key = 'data_' + attributes[ii]['type']\n", " aa = csvProt[key]\n", " except:\n", " try:\n", " ii = tags.index('date')\n", " try:\n", " csvProt[key] = str(node['a_par'][ii].text).replace('\\t','').replace('\\n','').strip()\n", " try:\n", " ii = tags.index('emph')\n", " csvProt[key+'_note'] = content\n", " except:\n", " pass\n", " try:\n", " ii = tags.index('date')\n", " key = 'data_' + attributes[ii]['type'] + '_normalizzata'\n", " try:\n", " norm = attributes[ii]['normal']\n", " csvProt[key] = norm\n", " except:\n", " csvProt[key] = 'NOTNORMAL'\n", " except:\n", " pass\n", " except:\n", " csvProt[key] = str(content).replace('\\t','').replace('\\n','').strip()\n", " try:\n", " ii = tags.index('date')\n", " key = 'data_' + attributes[ii]['type'] + '_normalizzata'\n", " try:\n", " norm = attributes[ii]['normal']\n", " csvProt[key] = norm\n", " except:\n", " csvProt[key] = 'NOTNORMAL'\n", " except:\n", " pass\n", " except:\n", " pass \n", "\n", " # e18 Data 1: periodo + note + normalizzazione\n", " if('unitdate' in tags):\n", " csvProt['audience'] = attributes[tags.index('c')]['audience']\n", " csvProt['data_periodo'] = str(content).replace('\\t','').replace('\\n','').strip()\n", " try:\n", " ii = tags.index('emph')\n", " csvProt['data_periodo_note'] = content\n", " except:\n", " pass\n", " \n", " if('unitdate' in tags):\n", " try:\n", " ii = tags.index('unitdate')\n", " key = 'data_periodo_normalizzata'\n", " try:\n", " norm = attributes[ii]['normal']\n", " csvProt[key] = norm\n", " except:\n", " csvProt[key] = 'NOTNORMAL'\n", " except:\n", " pass\n", " \n", " # e21 Physdesc per subfonds\n", " if (attributes[tags.index('c')]['level']=='subfonds'):\n", " try:\n", " ii = tags.index('extent')\n", " try:\n", " csvProt['numero'] = csvProt['numero'] + ' | ' + content\n", " except:\n", " csvProt['numero'] = content\n", " except:\n", " pass\n", " try:\n", " ii = tags.index('genreform')\n", " try:\n", " csvProt['genere'] = csvProt['genere'] + ' | ' + content\n", " except:\n", " csvProt['genere'] = content\n", " except:\n", " pass \n", " # e21 Container (solo per subseries vale come extent) \n", " if (attributes[tags.index('c')]['level']=='subseries'):\n", " try:\n", " ii = tags.index('container')\n", " csvProt['extent'] = content\n", " except:\n", " pass\n", "\n", " # e11 Il titolo da unittitle\n", " # e11 Il titolo da unittitle\n", " if(attributes[tags.index('c')]['level']=='item'):\n", " if('unittitle' in tags): \n", " try:\n", " aa = csvProt['titolo_aspo']\n", " except:\n", " try:\n", " ii = tags.index('unittitle') \n", " try:\n", " for chi in node['a_par'][ii]:\n", " tails = \"\" + chi.tail \n", " csvProt['titolo_aspo'] = (node['a_par'][ii].text + tails).replace('\\t','').replace('\\n','').strip()\n", " except:\n", " pass \n", " except:\n", " pass\n", " \n", " # e11 Il titolo da unittitle per SERIES\n", " if (attributes[tags.index('c')]['level']=='series'):\n", " if('unittitle' in tags): \n", " csvProt['titolo_aspo'] = content\n", "\n", " # e11 Il titolo da unittitle per SUBSERIES\n", " if (attributes[tags.index('c')]['level']=='subseries'):\n", " if('unittitle' in tags):\n", " try:\n", " aa = csvProt['titolo_aspo']\n", " except:\n", " try:\n", " ii = tags.index('unittitle') \n", " try:\n", " for chi in node['a_par'][ii]:\n", " tails = \"\" + chi.tail \n", " csvProt['titolo_aspo'] = (node['a_par'][ii].text + tails).replace('\\t','').replace('\\n','').strip()\n", " except:\n", " csvProt['titolo_aspo'] = content \n", " except:\n", " pass\n", " \n", " # e11 Il titolo da unittitle per OTHERLEVEL\n", " if (attributes[tags.index('c')]['level']=='otherlevel'):\n", " if('unittitle' in tags): \n", " csvProt['titolo_aspo'] = content \n", " \n", " # e11 Il titolo da unittitle per FONDS\n", " if (attributes[tags.index('c')]['level']=='fonds'):\n", " if('unittitle' in tags): \n", " csvProt['titolo_aspo'] = content \n", " \n", " # e11 Il titolo da unittitle per FILE\n", " if (attributes[tags.index('c')]['level']=='file'):\n", " if('unittitle' in tags): \n", " csvProt['titolo_aspo'] = content \n", " \n", " # e11 Il titolo da unittitle per COLLECTION\n", " if (attributes[tags.index('c')]['level']=='collection'):\n", " if('unittitle' in tags): \n", " csvProt['titolo_aspo'] = content\n", " \n", " \n", " # e11 Il titolo da unittitle per SUBFONDS\n", " if (attributes[tags.index('c')]['level']=='subfonds'):\n", " if('unittitle' in tags):\n", " try:\n", " aa = csvProt['titolo_aspo']\n", " except:\n", " try:\n", " ii = tags.index('unittitle') \n", " try:\n", " for chi in node['a_par'][ii]:\n", " tails = \"\" + chi.tail \n", " csvProt['titolo_aspo'] = (node['a_par'][ii].text + tails).replace('\\t','').replace('\\n','').strip()\n", " except:\n", " csvProt['titolo_aspo'] = content \n", " except:\n", " pass \n", "\n", "\n", " \n", " return csvProt\n", "\n", "\n", "# Di pari passo alla funzione, definisco un dict contenente tutti gli header;\n", "# servirà per il CSV.\n", "itemHeader = OrderedDict()\n", "\n", "# e1 ID dell'entità\n", "itemHeader.update({'id': ''})\n", "\n", "# e2 Audience: external o internal\n", "itemHeader.update({'audience': ''})\n", "\n", "# e3 Otherlevel\n", "itemHeader.update({'altro_livello': ''})\n", "\n", "# e4 Repository (qui dovrebbe essere sempre l'Archivio di Prato)\n", "itemHeader.update({'repository': '#'})\n", "\n", "# e5 Bioghist\n", "itemHeader.update({'bioghist': ''})\n", "\n", "# e6 Arrangement\n", "itemHeader.update({'arrangement': ''})\n", "\n", "# e7 Related Material\n", "itemHeader.update({'relatedmaterial': ''})\n", "\n", "# e8 Tipologia\n", "itemHeader.update({'tipologia': '#'})\n", "\n", "# e9 Segnature buste e registri Datini\n", "itemHeader.update(\n", "{'segnatura_registri_1': '#',\n", " 'segnatura_registri_2': '#',\n", " 'segnatura_inserto': '#',\n", " 'segnatura_busta': '#'})\n", "\n", "# e9 Segnatura codice\n", "itemHeader.update({'segnatura_codice': '#'})\n", "\n", "# e9 segnatura subseries\n", "itemHeader.update(\n", "{'segnatura_parent': '#'})\n", "\n", "# e10 Titolo originale\n", "itemHeader.update({'titolo_originale': '#'})\n", "\n", "# e11 Titolo ASPO\n", "itemHeader.update({'titolo_aspo': '<unittitle>#'})\n", "\n", "# e12 Scope content, head & body, e num allegati\n", "itemHeader.update(\n", "{'scope-content_head': '<scopecontent><head>#',\n", " 'scope-content_body': '<scopecontent><p>#', \n", " 'num_allegati': '<num>#'})\n", "# e12 lista merci\n", "itemHeader.update({'lista': '<list>#'})\n", "\n", "# e13 Origine\n", "itemHeader.update({'origine': '<origination label=#1>#2, #1 - #2'})\n", "\n", "# e14 Nome della compagnia\n", "itemHeader.update({'compagnia': '<corpname>#'})\n", "\n", "# e15 Soggetto\n", "itemHeader.update({'soggetto': '<subject>#'})\n", "itemHeader.update({'tipologia_soggetto': '<emph>#'})\n", "\n", "# e16 Persona + ruolo\n", "itemHeader.update(\n", "{'persona_tenutario': '<persname role=\"tenutario\">#', \n", " 'persona_destinatario': '<persname role=\"destinatario\">#',\n", " 'persona_mittente': '<persname role=\"mittente\">#',\n", " 'persona_indirizzata': '<persname role=\"indirizzata\">#',\n", " 'persona_mano': '<persname role=\"mano\">#',})\n", "\n", "# e17 Date varie: tutte quelle con 'type' definito + note\n", "itemHeader.update({'data_inizio': '<date type=\"inizio\">#', 'data_inizio_note': '<emph>#', 'data_inizio_normalizzata': '<date>#',\n", " 'data_fine': '<date type=\"fine\">#', 'data_fine_note': '<emph>#', 'data_fine_normalizzata': '<date>#',\n", " 'data_chiusura': '<date type=\"chiusura\">#', 'data_chiusura_note': '<emph>#' , 'data_chiusura_normalizzata': '<date>#' })\n", "\n", "# e18 Data 1: periodo\n", "itemHeader.update({'data_periodo': '<unitdate>#' , 'data_periodo_note': '<unitdate>#' ,'data_periodo_normalizzata': '<unitdate>#' })\n", "\n", "# e19 Luogo + 'ruolo'\n", "itemHeader.update(\n", "{\"luogo_partenza\": '<geogname role=\"partenza\">#',\n", " \"luogo_arrivo\": '<geogname role=\"arrivo\">#',\n", " \"luogo_luogo\": '<geogname role=\"luogo\">#'})\n", "\n", "# e20 Supporto fisico\n", "itemHeader.update({'supporto': '<physfacet type=\"supporto\">#'})\n", "\n", "# e21 descrizione fisica\n", "itemHeader.update({'descrizione_fisica': '<physdesc>#'})\n", "itemHeader.update({'numero': '<extent>#'})\n", "# container vale extent solo per subseries\n", "itemHeader.update({'extent': '<container>#'})\n", "itemHeader.update({'genere': '<genreform>#'})\n", "\n", "# e22 phystech\n", "itemHeader.update({'conservazione': '<phystech>#'})\n", "\n", "# e23 Consistenza\n", "itemHeader.update({'consistenza': '<extent unit=#1>#2, #1: #2'})\n", "\n", "# e24 Note\n", "itemHeader.update({'nota': '<note>#'})\n", "\n", "# e25 Odd\n", "itemHeader.update({'altre_informazioni': '<odd>#'})\n", "\n", "# e26 Oggetto digitale allegato (nome)\n", "itemHeader.update({'oggetto_digitale': '<daoloc title=#>'})\n", "\n", "# e0: Le varie id dei nodi parent\n", "itemHeader.update(\n", "{'id_subfonds': '<c level=\"subfonds\" id=#>',\n", " 'id_fonds': '<c level=\"fonds\" id=#>',\n", " 'id_series': '<c level=\"series\" id=#>',\n", " 'id_subseries': '<c level=\"subseries\" id=#>',\n", " 'id_file': '<c level=\"file\" id=#>',\n", " 'id_otherlevel': '<c level=\"otherlevel\" id=#>',\n", " 'id_collection': '<c level=\"collection\" id=#>'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test della funzione traduttore\n", "\n", "**NB:** l'ho definita basandomi sugli item, ma sembra funzionare decentemente anche sugli altri livelli!" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "\n", "id_fonds: ASPO00000000\n", "\n", "id: ASPO00000001\n", "\n", "audience: external\n", "\n", "data_periodo: 1363 - 1416\n", "\n", "data_periodo_normalizzata: 13630101-14161231\n", "\n", "titolo_aspo: Fondaco di Avignone\n", "\n", "numero: 1506 | 284\n", "\n", "genere: lettere | documenti contabili\n", "\n" ] } ], "source": [ "jj = randint(0, len(allCs2['subfonds']))\n", "jj = 0\n", "print(jj)\n", "print()\n", "\n", "test = allCs2['subfonds'][0]\n", "toShow = traduttoreItem(test)\n", "for key in toShow.keys():\n", " print(key + ': ' + str(toShow[key]))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Export\n", "\n", "Produciamo il CSV per gli item tracciando, al solito, il tempo impiegato." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 46.13993215560913\n" ] } ], "source": [ "# Do it! Export del CSV - items.\n", "\n", "ts1 = datetime.timestamp(datetime.now())\n", "\n", "# Apro il file per l'export\n", "with open(export_dir + \"data_item_data.csv\", \"w\", newline=\"\") as csv_file:\n", " # Definisco la classe-motore per l'export\n", " writer = csv.DictWriter(csv_file, fieldnames=list(itemHeader.keys()))\n", " # Scrivo l'intestazione\n", " writer.writeheader()\n", " # Scrivo la seconda riga, esplicativa\n", " writer.writerow(itemHeader)\n", " # Scrivo gli item tradotti, uno a uno\n", " for ii in range(len(allCs2['item'])):\n", " test = allCs2['item'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Altri livelli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definisco un dizionario ridotto per l'header delle *subseries*, poi esporto -- per il momento con lo stesso traduttore usato per gli *item*" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.8612298965454102\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "subSeriesKeys = set()\n", "for ii in range(len(allCs2['subseries'])):\n", " test = allCs2['subseries'][ii]\n", " subSeriesKeys = subSeriesKeys.union( traduttoreItem(test).keys() )\n", "\n", "subSeriesHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in subSeriesKeys):\n", " subSeriesHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_subseries_data.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(subSeriesHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(subSeriesHeader)\n", " for ii in range(len(allCs2['subseries'])):\n", " test = allCs2['subseries'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Rinse & Repeat* con i livelli *series*, *subfonds* e *fonds*" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.1117551326751709\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "seriesKeys = set()\n", "for ii in range(len(allCs2['series'])):\n", " test = allCs2['series'][ii]\n", " seriesKeys = seriesKeys.union( traduttoreItem(test).keys() )\n", "\n", "seriesHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in seriesKeys):\n", " seriesHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_series_data.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(seriesHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(seriesHeader)\n", " for ii in range(len(allCs2['series'])):\n", " test = allCs2['series'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.021094083786010742\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "subfondsKeys = set()\n", "for ii in range(len(allCs2['subfonds'])):\n", " test = allCs2['subfonds'][ii]\n", " subfondsKeys = subfondsKeys.union( traduttoreItem(test).keys() )\n", "\n", "subfondsHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in subfondsKeys):\n", " subfondsHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_subfonds_data.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(subfondsHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(subfondsHeader)\n", " for ii in range(len(allCs2['subfonds'])):\n", " test = allCs2['subfonds'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.00766301155090332\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "fondsKeys = set()\n", "for ii in range(len(allCs2['fonds'])):\n", " test = allCs2['fonds'][ii]\n", " fondsKeys = fondsKeys.union( traduttoreItem(test).keys() )\n", "\n", "fondsHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in fondsKeys):\n", " fondsHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_fonds.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(fondsHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(fondsHeader)\n", " for ii in range(len(allCs2['fonds'])):\n", " test = allCs2['fonds'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 16.090816020965576\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "fileKeys = set()\n", "for ii in range(len(allCs2['file'])):\n", " test = allCs2['file'][ii]\n", " fileKeys = fileKeys.union( traduttoreItem(test).keys() )\n", "\n", "fileHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in fileKeys):\n", " fileHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_file_data.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(fileHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(fileHeader)\n", " for ii in range(len(allCs2['file'])):\n", " test = allCs2['file'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.03026413917541504\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "collectionKeys = set()\n", "for ii in range(len(allCs2['collection'])):\n", " test = allCs2['collection'][ii]\n", " collectionKeys = collectionKeys.union( traduttoreItem(test).keys() )\n", "\n", "collectionHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in collectionKeys):\n", " collectionHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_collection_data.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(collectionHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(collectionHeader)\n", " for ii in range(len(allCs2['collection'])):\n", " test = allCs2['collection'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tempo trascorso: 0.2395188808441162\n" ] } ], "source": [ "ts1 = datetime.timestamp(datetime.now())\n", "\n", "otherlevelKeys = set()\n", "for ii in range(len(allCs2['otherlevel'])):\n", " test = allCs2['otherlevel'][ii]\n", " otherlevelKeys = otherlevelKeys.union( traduttoreItem(test).keys() )\n", "\n", "otherlevelHeader = OrderedDict()\n", "for key in itemHeader:\n", " if(key in otherlevelKeys):\n", " otherlevelHeader[key] = itemHeader[key]\n", "\n", "\n", "with open(export_dir + \"data_otherlevel_date.csv\", \"w\", newline=\"\") as csv_file:\n", " writer = csv.DictWriter(csv_file, fieldnames=list(otherlevelHeader.keys()))\n", " writer.writeheader()\n", " writer.writerow(otherlevelHeader)\n", " for ii in range(len(allCs2['otherlevel'])):\n", " test = allCs2['otherlevel'][ii]\n", " writer.writerow(traduttoreItem(test))\n", "\n", "print('Tempo trascorso:', datetime.timestamp(datetime.now()) - ts1)" ] } ], "metadata": { "interpreter": { "hash": "397704579725e15f5c7cb49fe5f0341eb7531c82d19f2c29d197e8b64ab5776b" }, "kernelspec": { "display_name": "Python 3.9.0 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" }, "metadata": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }