Fonduer

Latest version: v0.8.3

Safety actively analyzes 619231 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 5

0.8.3

-------------------

Added
^^^^^
* `YasushiMiyata`_: Add :func:`get_max_row_num` to ``fonduer.utils.data_model_utils.tabular``.
(`469 <https://github.com/HazyResearch/fonduer/issues/469>`_)
(`480 <https://github.com/HazyResearch/fonduer/pull/480>`_)
* `HiromuHota`_: Add get_bbox() to :class:`Sentence` and :class:`SpanMention`.
(`429 <https://github.com/HazyResearch/fonduer/pull/429>`_)
* `HiromuHota`_: Add a custom MLflow model that allows you to package a Fonduer model.
See `here <../user/packaging.html>`_ for how to use it.
(`259 <https://github.com/HazyResearch/fonduer/issues/259>`_)
(`407 <https://github.com/HazyResearch/fonduer/pull/407>`_)
* `HiromuHota`_: Support spaCy v2.2.
(`384 <https://github.com/HazyResearch/fonduer/issues/384>`_)
(`432 <https://github.com/HazyResearch/fonduer/pull/432>`_)
* `wajdikhattel`_: Add multinary candidates.
(`455 <https://github.com/HazyResearch/fonduer/issues/455>`_)
(`456 <https://github.com/HazyResearch/fonduer/pull/456>`_)
* `HiromuHota`_: Add ``nullables`` to :func:`candidate_subclass()` to allow NULL mention in a candidate.
(`496 <https://github.com/HazyResearch/fonduer/issues/496>`_)
(`497 <https://github.com/HazyResearch/fonduer/pull/497>`_)
* `HiromuHota`_: Copy textual functions in :mod:`data_model_utils.tabular` to :mod:`data_model_utils.textual`.
(`503 <https://github.com/HazyResearch/fonduer/issues/503>`_)
(`505 <https://github.com/HazyResearch/fonduer/pull/505>`_)

Changed
^^^^^^^
* `YasushiMiyata`_: Enable `RegexMatchSpan` with concatenates words by sep="(separator)" option.
(`270 <https://github.com/HazyResearch/fonduer/issues/270>`_)
(`492 <https://github.com/HazyResearch/fonduer/pull/492>`_)
* `HiromuHota`_: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension."
(`421 <https://github.com/HazyResearch/fonduer/pull/421>`_)
* `HiromuHota`_: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi.
Since the Japanese tokenizer remains the same, there should be no impact on users.
(`384 <https://github.com/HazyResearch/fonduer/issues/384>`_)
(`432 <https://github.com/HazyResearch/fonduer/pull/432>`_)
* `HiromuHota`_: Log a stack trace on parsing error for better debug experience.
(`478 <https://github.com/HazyResearch/fonduer/issues/478>`_)
(`479 <https://github.com/HazyResearch/fonduer/pull/479>`_)
* `HiromuHota`_: :func:`get_cell_ngrams` and :func:`get_neighbor_cell_ngrams` yield
nothing when the mention is not tabular.
(`471 <https://github.com/HazyResearch/fonduer/issues/471>`_)
(`504 <https://github.com/HazyResearch/fonduer/pull/504>`_)

Deprecated
^^^^^^^^^^
* `HiromuHota`_: Deprecated :func:`bbox_from_span` and :func:`bbox_from_sentence`.
(`429 <https://github.com/HazyResearch/fonduer/pull/429>`_)
* `HiromuHota`_: Deprecated :func:`visualizer.get_box` in favor of :func:`span.get_bbox()`.
(`445 <https://github.com/HazyResearch/fonduer/issues/445>`_)
(`446 <https://github.com/HazyResearch/fonduer/pull/446>`_)
* `HiromuHota`_: Deprecate textual functions in :mod:`data_model_utils.tabular`.
(`503 <https://github.com/HazyResearch/fonduer/issues/503>`_)
(`505 <https://github.com/HazyResearch/fonduer/pull/505>`_)

Fixed
^^^^^
* `senwu`_: Fix pdf_path cannot be without a trailing slash.
(`442 <https://github.com/HazyResearch/fonduer/issues/442>`_)
(`459 <https://github.com/HazyResearch/fonduer/pull/459>`_)
* `kaikun213`_: Fix bug in table range difference calculations.
(`420 <https://github.com/HazyResearch/fonduer/pull/420>`_)
* `HiromuHota`_: mention_extractor.apply with clear=True now works even if it's not the first run.
(`424 <https://github.com/HazyResearch/fonduer/pull/424>`_)
* `HiromuHota`_: Fix :func:`get_horz_ngrams` and :func:`get_vert_ngrams` so that they
work even when the input mention is not tabular.
(`425 <https://github.com/HazyResearch/fonduer/issues/425>`_)
(`426 <https://github.com/HazyResearch/fonduer/pull/426>`_)
* `HiromuHota`_: Fix the order of args to Bbox.
(`443 <https://github.com/HazyResearch/fonduer/issues/443>`_)
(`444 <https://github.com/HazyResearch/fonduer/pull/444>`_)
* `HiromuHota`_: Fix the non-deterministic behavior in VisualLinker.
(`412 <https://github.com/HazyResearch/fonduer/issues/412>`_)
(`458 <https://github.com/HazyResearch/fonduer/pull/458>`_)
* `HiromuHota`_: Fix an issue that the progress bar shows no progress on preprocessing
by executing preprocessing and parsing in parallel.
(`439 <https://github.com/HazyResearch/fonduer/pull/439>`_)
* `HiromuHota`_: Adopt to mlflow>=1.9.0.
(`461 <https://github.com/HazyResearch/fonduer/issues/461>`_)
(`463 <https://github.com/HazyResearch/fonduer/pull/463>`_)
* `HiromuHota`_: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL".
(`473 <https://github.com/HazyResearch/fonduer/issues/473>`_)
(`477 <https://github.com/HazyResearch/fonduer/pull/477>`_)
* `HiromuHota`_: Fix :func:`_get_axis_ngrams` not to return ``None`` when the input is not tabular.
(`481 <https://github.com/HazyResearch/fonduer/pull/481>`_)
* `HiromuHota`_: Fix :func:`Visualizer.display_candidates` not to draw rectangles on wrong pages.
(`488 <https://github.com/HazyResearch/fonduer/pull/488>`_)
* `HiromuHota`_: Persist doc only when no error happens during parsing.
(`489 <https://github.com/HazyResearch/fonduer/issues/489>`_)
(`490 <https://github.com/HazyResearch/fonduer/pull/490>`_)

0.8.2

-------------------

Deprecated
^^^^^^^^^^

* `HiromuHota`_: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0.
Please decorate them with ``snorkel.labeling.labeling_function``.

Fixed
^^^^^
* `HiromuHota`_: Labeling functions can now be decorated with ``snorkel.labeling.labeling_function``.
(`400 <https://github.com/HazyResearch/fonduer/issues/400>`_)
(`401 <https://github.com/HazyResearch/fonduer/pull/401>`_)

0.8.1

-------------------

Added
^^^^^
* `senwu`_: Add `mode` argument in create_task to support `STL` and `MTL`.

.. note::
Fonduer has a new `mode` argument to support switching between different learning modes
(e.g., STL or MLT). Example usage:

.. code:: python

Create task for each relation.
tasks = create_task(
task_names = TASK_NAMES,
n_arities = N_ARITIES,
n_features = N_FEATURES,
n_classes = N_CLASSES,
emb_layer = EMB_LAYER,
model="LogisticRegression",
mode = MODE,
)

0.8.0

-------------------

Changed
^^^^^^^
* `senwu`_: Switch to Emmental as the default learning engine.

.. note::
Rather than maintaining a separate learning engine, we switch to Emmental,
a deep learning framework for multi-task learning. Switching to a more general
learning framework allows Fonduer to support more applications and
multi-task learning. Example usage:

.. code:: python

With Emmental, you need do following steps to perform learning:
1. Create task for each relations and EmmentalModel to learn those tasks.
2. Wrap candidates into EmmentalDataLoader for training.
3. Training and inference (prediction).

import emmental

Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)

Initialize Emmental. For customize Emmental, please check here:
https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)

1. Create task for each relations and EmmentalModel to learn those tasks.

Generate special tokens which are used for LSTM model to locate mentions.
In LSTM model, we pad sentence with special tokens to help LSTM to learn
those mentions. Example:
Original sentence: Then Barack married Michelle.
-> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
special_tokens += [f"~~[[{i}", f"{i}]]~~"]

Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
word_counter=word_counter, word_dim=300, specials=special_tokens
)

Create task for each relation.
tasks = create_task(
ATTRIBUTE,
2,
F_train[0].shape[1],
2,
emb_layer,
model="LogisticRegression",
)

Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")

Add tasks into model
for task in tasks:
model.add_task(task)

2. Wrap candidates into EmmentalDataLoader for training.

Here we only use the samples that have labels, which we filter out the
samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]

Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE,
train_cands[0],
F_train[0],
emb_layer.word2id,
train_marginals,
train_idxs,
),
split="train",
batch_size=100,
shuffle=True,
)

Create test dataloader to do prediction.
Build test dataloader
test_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
),
split="test",
batch_size=100,
shuffle=False,
)

3. Training and inference (prediction).

Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])

Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)

* `HiromuHota`_: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X.
Accordingly, user-defined labels should now be 0-indexed (used to be
1-indexed).
(`310 <https://github.com/HazyResearch/fonduer/issues/310>`_)
(`320 <https://github.com/HazyResearch/fonduer/pull/320>`_)
* `HiromuHota`_: Use executemany_mode="batch" instead of deprecated use_batch_mode=True.
(`358 <https://github.com/HazyResearch/fonduer/issues/358>`_)
* `HiromuHota`_: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook.
(`360 <https://github.com/HazyResearch/fonduer/issues/360>`_)
* `HiromuHota`_: To support ImageMagick7, expand the version range of Wand.
(`373 <https://github.com/HazyResearch/fonduer/pull/373>`_)
* `HiromuHota`_: Comply with PEP 561 for type-checking codes that use Fonduer.
* `HiromuHota`_: Make UDF.apply of all child classes unaware of the database backend,
meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply.
(`316 <https://github.com/HazyResearch/fonduer/issues/316>`_)
(`368 <https://github.com/HazyResearch/fonduer/pull/368>`_)

Fixed
^^^^^
* `senwu`_: Fix mention extraction to return mention classes instead of data model
classes.

0.7.1

-------------------

Added
^^^^^
* `senwu`_: Refactor `Featurization` to support user defined customized feature
extractors and rename existing feature extractors' name to match the paper.

.. note::

Rather than using a fixed multimodal feature library along, we have added an
interface for users to provide customized feature extractors. Please see our
full documentation for details.

.. code:: python

from fonduer.features import Featurizer, FeatureExtractor

Example feature extractor
def feat_ext(candidates):
for candidate in candidates:
yield candidate.id, f"{candidate.id}", 1

feature_extractors=FeatureExtractor(customize_feature_funcs=[feat_ext])
featurizer = Featurizer(session, [PartTemp], feature_extractors=feature_extractors)

Rather than:

.. code:: python

from fonduer.features import Featurizer

featurizer = Featurizer(session, [PartTemp])

* `HiromuHota`_: Add page argument to get_pdf_dim in case pages have different dimensions.
* `HiromuHota`_: Add Labelerupsert_keys.
* `HiromuHota`_: Add `vizlink` as an argument to `Parser` to be able to plug a custom visual linker.
Unless otherwise specified, `VisualLinker` will be used by default.

.. note::

Example usage:

.. code:: python

from fonduer.parser.visual_linker import VisualLinker
class CustomVisualLinker(VisualLinker):
def __init__(self):
"""Your code"""

def link(self, document_name: str, sentences: Iterable[Sentence], pdf_path: str) -> Iterable[Sentence]:
"""Your code"""

def is_linkable(self, filename: str) -> bool:
"""Your code"""

from fonduer.parser import Parser
parser = Parser(session, vizlink=CustomVisualLinker())

* `HiromuHota`_: Add `LingualParser`, which any lingual parser like `Spacy` should inherit from,
and add `lingual_parser` as an argument to `Parser` to be able to plug a custom lingual parser.
* `HiromuHota`_: Annotate types to some of the classes incl. preprocesssors and parser/models.
* `HiromuHota`_: Add table argument to ``Labeler.apply`` (and ``Labeler.update``), which can now be used to annotate gold labels.

.. note::

Example usage:

.. code:: python

Define a LF for gold labels
def gold(c: Candidate) -> int:
if some condition:
return TRUE
else:
return FALSE

labeler = Labeler(session, [PartTemp, PartVolt])
Annotate gold labels
labeler.apply(docs=docs, lfs=[[gold], [gold]], table=GoldLabel, train=True)
A label matrix can be obtained using the name of annotator, "gold" in this case
L_train_gold = labeler.get_gold_labels(train_cands, annotator="gold")
Annotate (noisy) labels
labeler.apply(split=0, lfs=[[LF1, LF2, LF3], [LF4, LF5]], train=True)

Note that the method name, "gold" in this example, is referred to as annotator.

Changed
^^^^^^^
* `HiromuHota`_: Load a spaCy model if possible during `Spacy__init__`.
* `HiromuHota`_: Rename Spacy to SpacyParser.
* `HiromuHota`_: Rename SimpleTokenizer into SimpleParser and let it inherit LingualParser.
* `HiromuHota`_: Move all ligual parsers into lingual_parser folder.
* `HiromuHota`_: Make load_lang_model private as a model is internally loaded during init.
* `HiromuHota`_: Add a unit test for ``Parser`` with tabular=False.
(`261 <https://github.com/HazyResearch/fonduer/pull/261>`_)
* `HiromuHota`_: Now ``longest_match_only`` of ``Union``, ``Intersect``, and ``Inverse`` override that of child matchers.
* `HiromuHota`_: Use the official name "beautifulsoup4" instead of an alias "bs4".
(`306 <https://github.com/HazyResearch/fonduer/issues/306>`_)
* `HiromuHota`_: Pin PyTorch on 1.1.0 to align with Snorkel of 0.9.X.
* `HiromuHota`_: Depend on psycopg2 instead of psycopg2-binary as the latter is not recommended for production.
* `HiromuHota`_: Change the default value for ``delim`` of ``SimpleParser`` from "<NB>" to ".".
(`272 <https://github.com/HazyResearch/fonduer/pull/272>`_)

Deprecated
^^^^^^^^^^
* `HiromuHota`_: Classifier and its subclass disc_models are deprecated, and in v0.8.0 they will be removed.

Removed
^^^^^^^
* `HiromuHota`_: Remove __repr__ from each mixin class as the referenced attributes are not available.
* `HiromuHota`_: Remove the dependency on nltk, but ``PorterStemmer()`` can still be used,
if it is provided as ``DictionaryMatch(stemmer=PorterStemmer())``.
* `HiromuHota`_: Remove ``_NgramMatcher`` and ``_FigureMatcher`` as they are no longer needed.
* `HiromuHota`_: Remove the dependency on Pandas and visual_linker._display_links.

Fixed
^^^^^
* `senwu`_: Fix legacy code bug in ``SymbolTable``.
* `HiromuHota`_: Fix the type of max_docs.
* `HiromuHota`_: Associate sentence with section and paragraph no matter what tabular is.
(`261 <https://github.com/HazyResearch/fonduer/pull/261>`_)
* `HiromuHota`_: Add a safeguard that prevents from accessing Meta.engine before it is assigned.
Also this change allows creating a mention/candidate subclass even before Meta is initialized.
* `HiromuHota`_: Create an Engine and open a connection in each child process.
(`323 <https://github.com/HazyResearch/fonduer/issues/323>`_)
* `HiromuHota`_: Fix ``featurizer.apply(docs=train_docs)`` fails on clearing.
(`250 <https://github.com/HazyResearch/fonduer/issues/250>`_)
* `HiromuHota`_: Correct abs_char_offsets to make it absolute.
(`332 <https://github.com/HazyResearch/fonduer/issues/332>`_)
* `HiromuHota`_: Fix deadlock error during Labeler.apply and Featurizer.apply.
(`328 <https://github.com/HazyResearch/fonduer/issues/328>`_)
* `HiromuHota`_: Avoid networkx 2.4 so that snorkel-metal does not use the removed API.
* `HiromuHota`_: Fix the issue that Labeler.apply with docs instead of split fails.
(`340 <https://github.com/HazyResearch/fonduer/pull/340>`_)
* `HiromuHota`_: Make mention/candidate_subclasses and their objects picklable.
* `HiromuHota`_: Make Visualizerdisplay_candidates mention-type argnostic.
* `HiromuHota`_: Ensure labels get updated when LFs are updated.
(`336 <https://github.com/HazyResearch/fonduer/issues/336>`_)

0.7.0

-------------------

Added
^^^^^
* `HiromuHota`_: Add notes about the current implementation of data models.
* `HiromuHota`_: Add Featurizerupsert_keys.
* `HiromuHota`_: Update the doc for OS X about an external dependency on libomp.
* `HiromuHota`_: Add test_classifier.py to unit test Classifier and its subclasses.
* `senwu`_: Add test_simple_tokenizer.py to unit test simple_tokenizer.
* `HiromuHota`_: Add test_spacy_parser.py to unit test spacy_parser.

Changed
^^^^^^^
* `HiromuHota`_: Assign a section for mention spaces.
* `HiromuHota`_: Incorporate entity_confusion_matrix as a first-class citizen and
rename it to confusion_matrix because it can be used both entity-level
and mention-level.
* `HiromuHota`_: Separate Spacy_split_sentences_by_char_limit to test itself.
* `HiromuHota`_: Refactor the custom sentence_boundary_detector for readability
and efficiency.
* `HiromuHota`_: Remove a redundant argument, document, from Spacysplit_sentences.
* `HiromuHota`_: Refactor TokenPreservingTokenizer for readability.

Removed
^^^^^^^
* `HiromuHota`_: Remove ``data_model_utils.tabular.same_document``, which
always returns True because a candidate can only have mentions from the same
document under the current implemention of ``CandidateExtractorUDF``.

Fixed
^^^^^
* `senwu`_: Fix the doc about the PostgreSQL version requirement.

Page 1 of 5

Releases

Has known vulnerabilities

Fonduer

Page 1 of 5

0.8.3

0.8.2

0.8.1

0.8.0

0.7.1

0.7.0

Page 1 of 5

Links

Releases